Post-training quantization of neural network through
              correlation maximization
                        Maria Pushkareva                                                               Iakov Karandashev
   Center of Optical Neural Technologies, Scientific Research                      Center of Optical Neural Technologies, Scientific Research
                Institute of System Analisys RAS                                                Institute of System Analisys RAS
                         Moscow, Russia                                                                  Moscow, Russia
                 pushkareva.mariia@yandex.ru                                                        karandashev@niisi.ras.ru

    Abstract—In this paper, we propose a method for                          believe that this result is correct and in what follows, we
quantizing the weights of neural networks by maximizing the                  choose the interval boundaries and the quantized values of
correlations between the initial and quantized weights, taking               the weights inside of the intervals proceeding from a
into account the distribution of the weight density in each                  maximal correlation principle.
layer. Quantization is performed after the neural network
training without further post-training. We tested the algorithm                  Frequently to get a sufficient accuracy of the neural
using the ImageNet dataset for VGG-16, ResNet-50, and                        network processing we have to combine the quantization and
Xception neural networks [2]. In the case of ResNet-50 and                   the neural network post-training. This procedure requires
Xception neural networks, 4-5 bits of memory are required for                substantial resources [5, 6]. In the present paper, we perform
the weights of a single layer to obtain acceptable Top-5                     the quantization after the neural network training without the
accuracy, for VGG-16, 3-4 bits are sufficient to store the                   following post-training. This method allows us to reduce the
weights of a single layer.                                                   quantization costs substantially.
   Keywords—weights quantization,             post-learning,     linear     II. ESTIMATE OF CORRELATION AND ITS GRADIENT
quantization, exponential quantization
                                                                                 For each layer, let us quantize the weights inside the
                         I.    INTRODUCTION                                  interval [ w , w ] , where w m in is the minimal and w m a x
                                                                                           m in     m ax

    The majority of neural networks, which we use when                       is the maximal value of the weights in the given layer,
solving image recognition problems, have many parameters                     respectively. (We normalized all the weights, so that
that have to be stored. Consequently, a substantial memory                   w  ( w  w ) /  w ). Let B     be the number of bits necessary to
capacity is necessary and this requirement limits such neural
                                                                                                                                     B
networks applicability. For example, the storage capacity of                 store the weights of one layer. Then n  2 is the number of
the neural networks VGG-16 and ResNet152V2 are 528 [3]                       the quantized weight values, as x we define the boundaries
and 232 [4] MBs, respectively. The quantization and                                                                        i


reduction of the number of weights are basic approaches                      of the intervals where the weight values are constant.
allowing us to decrease the memory size necessary to store                   Consequently,
the neural network weights.
                                                                                         w m in  x 0  x1  ...  x n  1  x n  w m a x .      (1)
    The quantization is a reduction in the variety of the
different values of the weights in the layer. The most popular                   Let w be the input weights, and y i the quantized value
quantization methods are application of the fixed-point                      inside the interval ( x i , x i  1 ) . It is not evident how we have to
formats in place of the floating-point formats [9],
binarization [10], ternarization [11], use of a logarithmic                  choose the interval boundaries x i as well as the values x i
scale [12] and so on. Usually, one can reduce the number of                  inside each interval. In the present paper we suppose that the
the weights with the aid of such methods as pruning                          stronger correlation between the input and the quantized
algorithms [13], sharing weights (including application of the               values the less the error of the quantized neural network
convolution operation) [14], tensor expansions [15] and so                   performance compared with the initial neural network:
on.
    In the present paper, we explore the quantization process                                                      wy
                                                                                                      (w, y)                  m ax .           (2)
for trained neural networks. The number B of bits per weight                                                       w y
defines the number of different values of the weights; and it,
                                B
                                                                             where w y            is a covariation between the initial and the
consequently, is equal to 2 . We perform the quantization
process independently for each layer. In the given layer we                  quantized values;  w and  y                 are standard deviations of
split the whole range of weights from the minimal to the                     the values of the input weight and their quantized values,
                         B
maximal value into 2 intervals. Then the weights belonging                   respectively. For simplicity we suppose that inside a layer
to one interval we replace by a single value. In what follows                the distribution of the weights is symmetric and the averaged
we examine a question related the optimal choiсe of the                      value of the weights is equal to zero. As we show in what
interval boundaries as well as the values with which we have                 follows this assumption is nearly always carried out in the
to replace the weights.                                                      case of large deep neural networks.
     The authors of paper [7] discussed the optimal                             We can estimate the covariation w y between the input
quantization problem for the Hopfield neural network. They                   and the quantized values as
showed that when maximizing correlations between the
initial and quantized values of the weights it was possible to
minimize the errors of the quantized neural network. We


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

                                            
                                                                                                                         the intervals       xi   , which we use as the optimization
                              wy            w y (w ) p (w )dw .                                                  (3)
                                                                                                                         parameters. For the given values of x i we define the
                                           

Where p ( w ) is the density of the weight distribution inside                                                           quantized weights y i with the aid of Eq. (8).
the layer. Since y is a constant inside the interval ( x i , x i  1 ) ,
                                                                                                                             As the density function p ( w ) we use its kernel
the last equation takes the form
                                                                                                                         estimation calculated taking into account 10,000 random
                                           n 1
                                                            x i 1                                                       weights from the layer. When the number of the weights is
                              w y   yi  w p (w )dw .                                                            (4)   less than 10,000, all the weights have to be taken into
                                           i0               xi
                                                                                                                         account.
                                                                                                                             The integral formulas Eq. (6) we replaced by numerical
Similarly, the variance of the quantized values is
                                                                                                                         estimates for c i and p i calculated using the real weights in
                                                                     n 1
                                                                                         x i 1
                                                                                                                         the layer:
            y   y p (w )dw   yi                                                       p (w )dw .
               2                   2                                                 2
                                                                                                                   (5)
                                                                    i0                 xi
                                                                                                                                            c i   w I ( x i  w  x i 1 ) / N w
                                                                                                                                                    w
If we introduce additional notations                                                                                                                                                                (11)
                                                                                                                                            p i   I ( x i  w  x i 1 ) / N w
                   x i 1                                                                x i 1
                                                                                                                                                    w
           сi       w p ( w ) d w and p i                                                p (w )dw .            (6)
                    xi                                                                     xi                                Here N w is the number of the weights in the layer,
we can simplify Eqs. (4-5) significantly:                                                                                I ( x i  w  x i  1 ) is an indicator function that accounts for

                                                                                                                         the weights belonging to the interval ( x i , x i  1 ) only.
                            n 1                                                               n 1
                                                                                                                                    To initialize the gradient ascent it is necessary to
            w y   yiсi                                                y   yi pi .
                                                                                 2                    2
                                                                                                                   (7)   choose an initial partition that is an initial set
                            i0                                                                i0
                                                                                                                         [ w m in , x 1 , x 2 , ..., x n  1 , w m a x ] . In our simulations, we used the
   To maximize the correlation (see Eq. 2) we have two sets                                                              linear and exponential partitions from [8] as the initial sets
of parameters. They are the boundaries of the intervals x i                                                              and examined quantization of the pre-trained neural networks
                                                                                                                         ResNet-50, Xception, and VGG-16. We employed the
and the quantized values y inside the intervals. For some
                                                      i                                                                  programming language Python and the framework Keras [2].
time, we forget about splitting into intervals and suppose that
we know the boundaries x . Then after the optimization  i


procedure with regard to y we obtain:             i


                                            yi  ci / pi .                                                         (8)

    When we substitute this value into Eq. (2), account for
the formulas (4-5), and have in mind that   c o n s t it is                                             w


independent of either x i or y i , we obtain the following
optimization problem:
                                                  n 1

                     (w, y)                     c / p  m ax .
                                                                  2
                                                               i             i
                                                                                                                   (9)
                                                  i0


    We differentiate this expression with respect to x and                                                     i
                                                                                                                         Fig. 1. Examples of weight histograms in convolution and fully connected
obtain an expression for the gradient   :                                                     i
                                                                                                                         layers for VGG-16 and ResNet-50.

                                                                                                                               As a result of minimization we obtained the optimal
                                     p ( x i )( y i  1  y i )( y i  1  y i  2 x i )
        i                                                                                                 . (10)     boundaries of the intervals [ w m in , x 1 , x 2 , ..., x n  1 , w m a x ] as
                   xi                                                   2 w
                                                                                                                         well as the corresponding set of the quantized weights
 III.    DESCRIPTION OF QUANTIZATION PROCEDURE                                                                           [ y 0 , y 1 , ..., y n  1 ] . Then we used the quantized weights in
                                                                                                                         place of the input weights without post-training of the neural
    The obtained expression (10) for the gradient   i                                                                  networks. The Python code is given in Appendix B.
allowed us to implement a quick algorithm for correlation
maximization based on the gradient descent algorithm. In the                                                                                            IV.     RESULTS
course of the algorithm running it adjusts the boundaries of                                                                In Fig. 1, we show weight histograms for convolution
                                                                                                                         and fully connected layers of the VGG-16 and ResNet-50


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                                                               116
Data Science

neural networks. As we mentioned before, the weight                       quantization, implemented the correlation maximization
distributions in the layers of deep neural networks are nearly            algorithm, and after that calculated the values of the average
symmetric with respect to zero and their average values are               correlations.
close to zero.                                                                Tables 1 – 3 show that when the number of bits is small
                                                                          (up to 5 bits, i.e. when the number of intervals is less or
 TABLE I. CORRELATIONS AVERAGED OVER ALL LAYERS FOR                       equal to 32) the maximization algorithm does increase the
 RESNET-50 NEURAL NETWORK AFTER LINEAR (LINEAR) AND
                                                                          correlation averaged over the layers. The linear quantization
    EXPONENTIAL (EXPONENTIAL) QUANTIZATION AND
        SUBSEQUENT CORRELATION MAXIMIZATION                               with the subsequent maximization leads to the average
               Average correlation for ResNet-50                          correlation growth in all the examined cases. The
                                                                          exponential quantization with the subsequent maximization
                  Lin          Max            Exp           Max
                                                                          provides the growth of the average correlation only when
    3 bit       0.8169        0.9376         0.9461        0.9659
                                                                          the number of intervals is equal to 8, 16 or 32. When the
                                                                          number of bits is larger (that is when the number of intervals
    4 bit       0.8473        0.9726         0.9822        0.9898         is 64 or 128), the maximization algorithm fails since for
    5 bit       0.9233        0.9890         0.9945        0.9972         some layers its results are worth comparing with the results
    6 bit       0.9739        0.9957         0.9983        0.9990         of the exponential quantization.
    7 bit       0.9925        0.9980         0.9995        0.9993          TABLE IVb. TOP-1 ACCURACY FOR VGG-16 NEURAL NETWORK;
                                                                            EXP DENOTES EXPONENTIAL QUANTIZATION, OPT STANDS
 TABLE II. CORRELATIONS AVERAGED OVER ALL LAYERS FOR                        FOR OPTIMAL QUANTIZATION, MAX DENOTES THE BEST OF
  XCEPTION NEURAL NETWORK AFTER LINEAR (LINEAR) AND                        TWO WAYS OF QUANTIZATION, AND FAIL OPT IS FRACTION OF
     EXPONENTIAL (EXPONENTIAL) QUANTIZATION AND                            NEURAL NETWORK LAYERS WHERE OPTIMIZATION LEADS TO
        SUBSEQUENT CORRELATION MAXIMIZATION                                 WORTH CORRELATION THAN EXPONENTIAL QUANTIZATION
                                                                                                         VGG-16
                   Average correlation for Xception                                        %fail     exp        opt     max
                  Lin          Max                                              3 bit        0%     0.69        0.73    0.76
                                               Exp          Max
                                                                                4 bit        0%     0.87        0.91    0.91
    3 bit       0.8630        0.9552         0.9566        0.9735               5 bit        0%     0.89         0.9    0.93
                                                                                6 bit        0%      0.9        0.93    0.93
    4 bit       0.9072        0.9814         0.9855        0.9919               7 bit        6%      0.9        0.94    0.94
                                                                               32 bit                     0.94
    5 bit       0.9611        0.9926         0.9955        0.9972
    6 bit       0.9880        0.9960         0.9986        0.9984              TABLE IVc. TOP-1 ACCURACY FOR XCEPTION NEURAL
                                                                           NETWORK; EXP DENOTES EXPONENTIAL QUANTIZATION, OPT
    7 bit       0.9967        0.9972         0.9996        0.9986           STANDS FOR OPTIMAL QUANTIZATION, MAX DENOTES THE
                                                                             BEST OF TWO WAYS OF QUANTIZATION, AND FAIL OPT IS
TABLE III. CORRELATIONS AVERAGED OVER ALL LAYERS FOR                             FRACTION OF NEURAL NETWORK LAYERS WHERE
  VGG-16 NEURAL NETWORK AFTER LINEAR (LINEAR) AND                             OPTIMIZATION LEADS TO WORTH CORRELATION THAN
    EXPONENTIAL (EXPONENTIAL) QUANTIZATION AND                                            EXPONENTIAL QUANTIZATION
         SUBSEQUENT CORRELATION MAXIMIZAION                                                               Xception
                   Average correlation for VGG-16                                          %fail     exp           opt   max
                                                                                3 bit        0%         0          0.02  0.01
                  Lin           Max             Exp          Max                4 bit        0%      0.43          0.65  0.61
                                                                                5 bit       10%      0.89          0.86  0.89
   3 bit        0.8325         0.9342         0.9464        0.9669              6 bit       15%      0.89           0.9   0.9
   4 bit        0.8469         0.9659         0.9816        0.9899              7 bit      34%       0.92          0.86  0.92
                                                                               32 bit                       0.92
   5 bit        0.8968         0.9862         0.9943        0.9973
                                                                              May be the maximization algorithm does not always run
   6 bit        0.9579         0.9943         0.9983        0.9992
                                                                          correctly because the speed of the gradient ascent lr has to
   7 bit        0.9872         0.9980         0.9995        0.9995        be chosen more accurately. This was the reason why for
    TABLE IVa. TOP-1 ACCURACY FOR RESNET50 NEURAL                         each layer we also used an algorithm allowing us to select a
NETWORK; EXP DENOTES EXPONENTIAL QUANTIZATION, OPT                        quantization with the maximal correlation. When after our
 STANDS FOR OPTIMAL QUANTIZATION, MAX DENOTES THE                         optimization the correlation decreased, we used the initial
  BEST OF TWO WAYS OF QUANTIZATION, AND FAIL OPT IS
      FRACTION OF NEURAL NETWORK LAYERS WHERE
                                                                          exponential quantization. Such an algorithm for quantizing a
   OPTIMIZATION LEADS TO WORTH CORRELATION THAN                           neural network, we called the “best”.
               EXPONENTIAL QUANTIZATION                                       In Tables 4a-c are Top-5 accuracies for the ResNet-50,
                             Res-Net-50                                   VGG-16 and Xception neural networks quantized using the
                %fail     exp          opt    max
     3 bit       0%       0.03         0.02     0                         exponential scale (exp), by the algorithm for maximizing the
     4 bit       0%       0.71        0.82    0.79                        correlation with the prior exponential splitting (opt), and
     5 bit       0%       0.91        0.93    0.93
                                                                          with the aid of the algorithm holding the exponential
     6 bit       7%       0.94        0.94    0.94
     7 bit      26%       0.94        0.94    0.94                        distribution in the layers where the correlation maximization
    32 bit                      0.94                                      failed (max). The column %fail max shows the percentage
   In Tables 1, 2, and 3, we present the values of the                    of layers for which  m a x c o r r   e x p . The examined neural
average correlations of prior weights and weights after
quantization for the ResNet-50, Xception, and VGG-16                      networks confirmed our hypothesis that the larger the
neural networks. We used the linear and exponential                       correlation between the input and quantized values the better


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                   117
Data Science

the accuracy of the quantized neural network. This is true                we choose the quantization with the maximal correlation
only if the above-mentioned correlation is larger on each                 between the quantized and input weights. The figures for
layer of the neural network while an increase of the                      Top-1 accuracies are given in Appendix A.
correlation averaged over all the layers does not guarantee
an increase of the accuracy.                                                                         1
                                                                                                            ResNet exp
    In Tables 4a-c, we also compare the Top-5 accuracies of


                                                                                   Top 5 accuracy
the neural networks quantized with the aid of our best
algorithm (max) with Top-5 accuracies of the initial neural                                         0,5
networks (32 bit). For the neural networks ResNet-50 and
Xception the Top-5 accuracy drop was 20-30% and less
than 3.5% when for storing weights we reserved 4 bit and 5                                           0
bit, respectively.                                                                                        3 bit 4 bit    5 bit    6 bit 7 bit
                                                                                                              max corr           exponential

                                 ResNet-50 linear
                          1
                                                                                                     1
                                                                                                            Xception exp
        Top 5 accuracy


                                                                                   Top 5 accuracy
                         0,5
                                                                                                    0,5


                          0
                               3 bit    4 bit 5 bit    6 bit 7 bit                                   0
                                       max corr           linear                                          3 bit 4 bit    5 bit    6 bit 7 bit
                                                                                                              max corr           exponential

                                 ResNet exp
                          1
                                                                                                     1
                                                                                                            VGG-16 exp
        Top 5 accuracy


                                                                                   Top 5 accuracy


                         0,5
                                                                                                    0,5


                          0
                               3 bit 4 bit     5 bit    6 bit 7 bit                                  0
                                   max corr            exponential                                        3 bit 4 bit    5 bit    6 bit 7 bit
                                                                                                              max corr           exponential
                                                                          Fig. 3. Top-5 accuracies for ResNet-50, Xception and VGG-16:
                          1
                                 Xception linear                          exponential quantization and quantization with correlation maximization
                                                                          with prior exponential splitting (max corr).
        Top 5 accuracy


                                                                                                            V.    CONCLUSIONS
                         0,5                                                 We developed the algorithm for neural network
                                                                          quantization based on maximization of correlations between
                                                                          the quantized and the input weights. When using this
                          0                                               algorithm no post-training of the neural network is necessary.
                               3 bit    4 bit 5 bit    6 bit 7 bit        Reserving 5 bits per layer, we succeeded in quantization of
                                       max corr           linear
                                                                          the VGG-16 neural network that leaded to 1% of the Top-5
                                                                          accuracy drop only. Under such compression, the required
Fig. 2. Top-5 accuracies for ResNet-50, Xception and VGG-16: linear       memory necessary to store weights is approximately 6 times
quantization and quantization with correlation maximization with prior    less than in the case of the full precision float (32 bits). For
linear splitting (max corr).                                              comparison, in paper [5] the VGG-16 neural network was
                                                                          compressed about 2.5 times by quantization and full
    In the case of the VGG-16 neural network, the Top-5                   algorithm compression is around 49 times, however their
accuracy drop was about 20% when we reserved 3 bits per                   neural network required a post-training for which a
layer and it was less than 3% when the number of the                      substantial computing power was necessary. When
reserved bits was larger.                                                 comparing with the results of paper [8] we see that for the
    In Figure 3a-b, for the Xception, ResNet-50 and VGG-                  ResNet-50, VGG-16 and Xception neural networks at the
16 neural networks we show the dependences of the Top-5                   same compression our Top-1 and Top-5 accuracies are
accuracies on the number of bits reserved for storage of each             better. Our compression allows to use only 3-4 bits to
weight obtained when we initialized them by linear and                    achieve more than 0.6 Top-5 accuracy for different
exponential splitting. The model accuracy increases when                  architecture without re-training.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                          118
Data Science

                                                                                                         APPENDIX A: TOP-1 ACCURACIES
                         ACKNOWLEDGMENT
                                                                                                         1     ResNet-50 linear
   The work financially supported by State Program of


                                                                                        Top 1 accuracy
SRISA    RAS     No.     0065-2019-0003    (AAA-A19-
119011590090-2).
                               REFERENCES
[1]    ImageNet – huge image dataset [Online]. URL: http://www.image-                                    0
       net.org.                                                                                              3 bit    4 bit 5 bit   6 bit 7 bit
                                                                                                                     max corr           linear
[2]    Models for image classification with weights trained on ImageNet
       [Online]. URL: https://keras.io/applications/.
[3]    K. Simonyan and A. Zisserman, “Very Deep Convolutional                                            1     Xception linear
       Networks for Large-Scale Image Recognition,” arXiv Preprint:


                                                                                        Top 1 accuracy
       1409.1556.
[4]    K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for
       Image Recognition,” arXiv Preprint: 1512.03385.
[5]    S. Han, H. Mao and W.J. Dally, “Deep compression: Compressing
       deep neural network with pruning, trained quantization and huffman                                0
       coding,” CoRR, ArXiv Preprint: 1510.00149. 2, 2015.                                                   3 bit    4 bit 5 bit   6 bit 7 bit
[6]    Sh. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, “Dorefa-net:                                              max corr           linear
       Training low bitwidth convolutional neural networks with low
       bitwidth gradients,” arXiv Preprint: 1606.06160.
                                                                                                         1     VGG-16 linear
[7]    B.V. Kryzhanovsky, M.V. Kryzhanovsky and M.Yu. Malsagov,
       “Discretization of a matrix in quadratic functional binary


                                                                                        Top 1 accuracy
       optimization,” Doklady Mathematics, vol. 83, pp. 413-417, 2011.
       DOI: 10.1134/S1064562411030197.
[8]    M.Yu. Malsagov, E.M. Khayrov, M.M. Pushkareva and I.M.
       Karandashev, “Exponential discretization of weights of neural
       network connections in pre-trained neural networks,” preprint, 2020.                              0
[9]    M. Courbariaux, Y. Bengio and J. David, “Training deep neural                                         3 bit    4 bit 5 bit   6 bit 7 bit
                                                                                                                     max corr           linear
       networks with low precision multiplications,” arXiv Preprint:
       1412.7024.
[10]   M. Courbariaux, Y. Bengio, J.-P. David, “BinaryConnect: Training                                  1     ResNet exp
       deep neural networks with binary weights during propagations,”
                                                                                        Top 1 accuracy


       Conference on Neural Information Processing Systems,
       arXiv:1511.00363.
[11]   Zh. Lin, M. Courbariaux, R. Memisevic and Y. Bengio, “Neural
       networks with few multiplications,” Proceedings of the International
       Conference on Learning Representations, arXiv:1510.03009.                                         0
[12]   E.H. Lee, D. Miyashita, E. Chai, B. Murmann and S.S. Wong,                                            3 bit 4 bit 5 bit 6 bit 7 bit
       “LogNet: Energy-efficient neural networks using logarithmic                                               max corr      exponential
       computation,” Proceedings of the IEEE International Conference on
       Acoustics, Speech and Signal Processing, 2017.
[13]   S. Han, J. Pool, J. Tran and W. Dally, “Learning both Weights and                                 1     Xception exp
       Connections for Efficient Neural Networks,” arXiv: 1506.02626,
                                                                                        Top 1 accuracy


       2015.
[14]   W. Chen, J. Wilson, S. Tyree and K. Weinberger, “Compressing
       Neural Networks with the Hashing Trick. Compressing Neural
       Networks with the Hashing Trick,” arXiv: 1504.04788, 2015.
[15]   V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets and V. Lempitsky,                                  0
       “Speeding-up Convolutional Neural Networks Using Fine-tuned CP-                                       3 bit 4 bit 5 bit 6 bit 7 bit
       Decomposition,” 3rd International Conference on Learning                                                  max corr      exponential
       Representations, ICLR San Diego, CA, USA, Conference Track
       Proceedings, 2015.
                                                                                                         1     VGG-16 exp
                                                                                        Top 1 accuarcy


                                                                                                         0
                                                                                                             3 bit 4 bit 5 bit 6 bit 7 bit
                                                                                                                 max corr      exponential

                                                                              Fig. 4. Top-1 accuracies for ResNet-50, Xception and VGG-16: linear/
                                                                              exponential (lin/exp) quantization and quantization with correlation
                                                                              maximization with prior linear/exponential splitting (max corr).


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                            119
Data Science

                                                            APPENDIX B: ALGORITHM
                                               # accessory functions
                                               def f(x, func, kde, X, x_min, x_max):
                                                  y, px, cov, p = func(x, kde, X, x_min, x_max)
                                                  return cov

                                               def grad(x, func, kde, X, x_min, x_max, alpha=10):
                                                  y, px, cov, p = func(x, kde, X, x_min, x_max)
                                                  step = alpha * px * (y[1:] - y[:-1]) * (y[1:] + y[:-
                                               1] - 2 * x) / 2
                                                  return step

                                               def cov_kde(x0, kde, X, x_min, x_max):
                                                 '''
                                                 calculate distribution function, quantized values
                                               and covariation on the set x0
                                                 X – weights in this layer,
                                                 x_min and x_max – minimal and maximal weight
                                               value in the layer
                                                 x0 current set (variable values only)
                                                 '''
                                                 p = np.zeros(len(x0) + 1)
                                                 C = np.zeros(len(x0) + 1)
                                                 y = np.zeros(len(x0) + 1)
                                                 x_ext = sorted(np.append(x0, [x_min, x_max]))
                                                 for i in range(len(x_ext)-1):
                                                     mask = np.logical_and(x_ext[i] < X, X <=
                                               x_ext[i + 1])
                                                     p[i] = len(X[mask])
                                                     C[i] = np.sum(X[mask])
                                                     if p[i] == 0:
                                                        C[i] = 0
                                                        p[i] = 1
                                                 y=C/p
                                                 px = kde.evaluate(x0)
                                                 cov = np.linalg.norm(C / np.sqrt(p)) #/ sigma_kde
                                                 return y, px, cov, p

                                               def results(kde, w, x0, x_min, x_max, func, bits,
                                               kde_std, ans_case='CG'):
                                                  '''
                                                  correlation maximization procedure for initial set
                                               x0 (only variable values),
                                                  w – layer,
                                                  kde – kernel density estimation on random sample
                                               from weights,
                                                  x_min and x_max – minimal and maximal weight
                                               values
                                                  '''
                                                  n_d = 2 ** bits
                                                  fx = lambda x: -f(x, func, kde, w, x_min, x_max)
                                                  gradx = lambda x: -grad(x, func, kde, w, x_min,
                                               x_max, alpha)
                                                  tol_curr = 1e-4
                                                  alpha = 10
                                                  ans = minimize(fun=fx, x0=x0, jac=gradx,
                                               method='CG', tol=tol_curr
                                                  solutions = ans['x']
                                                  correlations = -ans['fun']
                                                  gradients = np.linalg.norm(gradx(ans['x'])) / alpha
                                               / n_d
                                                  return solutions, correlations, gradients


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                   120