141


  Matrix Deep Neural Network and Its Rapid Learning in
                  Data Science Tasks
                 Iryna Pliss1, Olena Boiko2, Valentyna Volkova3, Yevgeniy Bodyanskiy4
      1. Control Systems Research Laboratory, Kharkiv National University of Radio Electronics, UKRAINE, Kharkiv, Nauky ave., 14,
                                                      email: iryna.pliss@nure.ua
    2. Control Systems Research Laboratory, Kharkiv National University of Radio Electronics, UKRAINE, Kharkiv, Nauky ave., 14,
                                                    email: olena.boiko@ukr.net
                       3. Samsung Electronics Ukraine Company, LLC R&D (SRK), UKRAINE, Kyiv, Lva Tolstogo St., 57,
                                                       email: v.volkova@samsung.com
      4. Control Systems Research Laboratory, Kharkiv National University of Radio Electronics, UKRAINE, Kharkiv, Nauky ave., 14,
                                                email: yevgeniy.bodyanskiy@nure.ua


   Abstract: The matrix deep neural network and its                                   to process images represented in the form of
learning algorithm are proposed. This system allows
reducing the number of tunable weights due to the
                                                                                                                      {              }
                                                                                       ( n1 × n2 ) -matrices X ( k ) = xi1i2 ( k ) (where i1 = 1, 2,..., n1
rejection    of   the   operations   of   vectorization-                              and i2 = 1, 2,..., n2 ), which must be vectorized before
devectorization. It also saves the information between                                submission to the network, i. e. they must be presented in the
rows and columns of 2D inputs.                                                        form of vectors [10], the dimension of which can be quite
   Keywords: deep learning, multilayer network, data                                  large, that leads to the effect of “curse of dimensionality”.
mining, 2D network.                                                                      This effect can be avoided by processing the original
                                                                                      matrix using convolution, pooling and encoding operations.
                           I. INTRODUCTION
                                                                                      As a result a vector of dimension smaller than ( n1n2 × 1) is
   Nowadays, artificial neural networks (ANNs) are widely                             fed to the perceptron’s input.
used to solve many problems arising in Data Science. Here,                               Although DNNs provide high quality of the information
multilayer perceptron (MLP) [1,13-18] is the most widely                              processing, their training time is too long, and the training
used. On the basis of MLP deep neural networks (DNNs)                                 process itself may require considerable computing resources.
[2-4,19,21] were developed, that have improved                                        However, it is possible to speed up the information
characteristics in comparison with their prototypes, namely                           processing by bypassing the operations of vectorization-
traditional shallow neural networks.                                                  devectorization, i.e. by storing information that will be
   In the general case, a multilayer perceptron that contains                         processed not in the form of a vector, but in the form of a
 L information processing layers ( L − 1 hidden layer and one                         matrix.
output layer) realizes a nonlinear transformation that can be                            The abovementioned problem is solved by the matrix
written in the form                                                                   neural networks [5,6,11,12], that are quite complex from the
    Yˆ ( k ) =Ψ ( X ( k )) =
                               L     L
                                      (
                            Ψ [ ] W [ ] ( k − 1) Ψ [ ] ×
                                                    L −1
                                                                                      computational point of view.

                                                                              )
                                                                                         In this connection, it seems expedient to develop
          (
        × W[
               L −1]
                       ( k − 1) Ψ[ L − 2] (...Ψ[1] (W [1] ( k − 1) X ( k ) ) )      architecture and algorithms for tuning a deep matrix neural
                                                                                      network that is characterized by the simplicity of the
where:                                                                                numerical realization and high speed of its synaptic weights
- Yˆ ( k ) denotes vector output signal of corresponding                              learning.
dimensions;                                                                                        II. ADAPTIVE BILINEAR MODEL
- X ( k ) denotes vector input signal of corresponding                                   The proposed matrix DNN is based on the adaptive matrix
dimensions;                                                                           bilinear model introduced earlier by the authors [7, 8]
- Ψ [l ] are diagonal matrices of activation functions on each                                          { }
                                                                                               Yˆ ( k ) ==
                                                                                                        yˆ j1 j2 A ( k − 1) X ( k ) B ( k − 1) ,
layer;                                                                                                         j1 = 1, 2,..., n1 ;                     (1)
- W [ ] ( k − 1) are matrices of synaptic weights that are
        l
                                                                                                                j2 = 1, 2,..., n2
                                                                                      where A ( k − 1) , B ( k − 1) are ( n1 × n1 ) , ( n2 × n2 ) -matrices
adjusted during the learning process based on error
backpropagation;
- l = 1, 2,..., L ;                                                                   of tunable parameters that are adjusted during online
                                                                                      learning-identification process.
- k = 1, 2,... is discrete time index.
                                                                                         For this, either the gradient adaptation procedure
   In the DNN family, the most popular are the convolutional
neural networks (CNNs) [20,22-25] that are mainly designed


                                  ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic
                                                                          142

                       A ( k )= A ( k − 1) + η A ( k ) ×
                      
                                                                                                { }     Ψ  ( A ( k − 1) X ( k ) B ( k − 1) ) =
                                                                                     Yˆ ( k ) =yˆ j1 j2 =
                                                                                                                                                                                             (4)
                                × E ( k ) B T ( k − 1) X T ( k ) ,                          = Ψ U (k ),
                                                                      (2)
                       B ( k )= B ( k − 1) + η B ( k ) ×                     which is in fact the matrix generalization of the
                                × X T ( k − 1) AT ( k ) E A ( k )
                                                                              transformation that is realized by any of the layers of a
                                                                             multilayer perceptron.
is used or its version optimized by speed [7] that can be                        In Eq. (4) Ψ denotes a ( n1 × n2 ) -matrix of activation
written as
                  ) A ( k − 1) + (Tr E ( k ) B T ( k − 1) ×
                                                                              functions, that acts elementwise on the matrix of internal
          A(k =                                                              activation signals of the system that are denoted by
         
         
         
                    × X T ( k ) X ( k ) B ( k − 1) E T ( k ) ) ×                         {
                                                                              U ( k ) = u j1 j2 ( k ) . }
                   × (Tr E ( k ) B T ( k − 1) X T ( k ) X ( k ) ×               In this case, the adjustment of the parameters of the
                                                                             nonlinear matrix model in Eq. (4) can be realized on the basis
                   ×B ( k − 1) B T ( k − 1) X T ( k ) X ( k ) ×              of the modified δ -rule
         
                                                                                        a j1 j2 (=k ) a j1 j2 ( k − 1) + η A ( k ) e j1 j2 ( k ) ×
                    ×B ( k − 1) E T ( k ) ) E ( k ) ×
                                            −1
         
                                                                                      
                                                                                        
                                                                                                                           (                       )
                                                                                                                             n2
                   ×B T ( k − 1) X T ( k ) ,                           (3)
                                                                                                     ×ψ  ′  u j1 j2 ( k )  ∑      b j1 j2 ( k − 1) xi1i2 ( k ) =
          
           B ( k )= B ( k − 1) + (Tr E A ( k ) A ( k ) X ( k ) ×
                                            T                                                                              i2 =1


          
                                                                                                 = a j1 j2 ( k − 1) + η A ( k ) e j1 j2 ( k ) ×
                   × X T ( k ) AT ( k ) E A ( k ) ) (Tr A ( k ) ×                      
          
                                                                                        
                                                                                        
                                                                                                                           (
                                                                                                      ×ψ ′ u j1 j2 ( k ) xˆi1 (=                   )
                                                                                                                                    k ) a j1 j2 ( k − 1) +
                   × X ( k ) X T ( k ) AT ( k ) E A ( k ) E AT ( k ) ×
          
                                                                                                    + η A ( k ) δ j1 j2 ( k ) xˆi1 ( k ) ,
                    × A ( k ) X ( k ) X T ( k ) AT ( k ) ) ×                             
                                                          −1                                                                                                    (5)
          
                                                                                        b j1 j2 (=
                                                                                                   k ) b j1 j2 ( k − 1) + η B ( k ) eA j1 j2 ( k ) ×
                  × X T ( k − 1) AT ( k ) E A ( k ) ,                                  
                                                                                                                           (                           )
                                                                                                                                 n1
                                                                                                     ×ψ ′ u A j1 j2 ( k ) ∑ a j1 j2 ( k − 1) xi1i2 ( k ) =
that is the matrix generalization of the Kaczmarz–Widrow–                                                                     i1 =1
Hoff learning algorithm (here η A ( k ) , η B ( k ) are learning                         
                                                                                                 = b j1 j2 ( k − 1) + η B ( k ) eA j1 j2 ( k ) ×
rate parameters,                                                                         
              E ( k ) = Y ( k ) − A ( k − 1) X ( k ) B ( k − 1) ,                                                       (
                                                                                                      ×ψ ′ u A j1 j2 ( k ) xˆi2 (=                     )
                                                                                                                                       k ) b j1 j2 ( k − 1) +
                                                                                        
               E A ( k ) =Y ( k ) − A ( k ) X ( k ) B ( k − 1) ,                                  + η B ( k ) δ A j1 j2 ( k ) xˆi2 ( k ) .
Y ( k ) is reference matrix signal).                                             On the basis of Eq. (4) it is easy to introduce into
                                                                              consideration a multilayer matrix neural network that realizes
   The learning algorithm in Eq. (3) can be given additional                  the transformation

                                                                                                               (
filtering properties if the learning rate parameters in Eq. (2)
are calculated using the recurrence relations that can be
written in the form
                                                                                     Yˆ ( k ) =Ψ  A[ ] ( k − 1) Ψ  A[ ] ( k − 1) ×
                                                                                                     L                 L −1
                                                                                                                                           ( (
                 η A−1 ( k=) rA ( k=) β rA ( k − 1) +                                          (                       (
                                                                                             × ...Ψ  A[ ] ( k − 1) X ( k ) B[ ] ( k − 1) ... ×
                                                                                                                           1                                           1
                                                                                                                                                                                       ) )   (6)

                        + Tr ( E ( k ) B T ( k − 1) ×                                        × B[
                                                                                                    L −1]
                                                                                                               ( k − 1) ) ) B[ L] ( k − 1)                     )
                        × X T ( k ) X ( k ) B ( k − 1) ×                         Using the learning algorithm from Eq. (5) and error
                        × B T ( k − 1) X T ( k ) X ( k ) ×                    backpropagation, it is possible to obtain the adaptive
                                                                              procedure for tuning all parameters of the matrix DNN in
                        × B ( k − 1) E T ( k ) )                              Eq. (6):
and                                                                           - for the output layer:
                   η B−1 ( k=) rB ( k=) β rB ( k − 1) +                               a[jLj] (= k ) a[j1 j]2 ( k − 1) + η A ( k ) δ [j1 j]2 ( k ) oˆi[1 ] ( k ) ,
                                                                                                          L                              L               L −1

                                                                                        [ L]
                                                                                            1 2


                       × Tr ( A ( k ) X ( k ) X T ( k ) ×                                         k ) b[j1 j]2 ( k − 1) + η B ( k ) δ A[ j]1 j2 ( k ) oˆ[A i2 ] ( k )
                                                                                       b j1 j2 (=
                                                                                                         L                              L                  L −1


                       × AT ( k ) E A ( k ) E AT ( k ) A ( k ) ×              where
                       × X ( k ) X T ( k ) AT ( k ) )                                                      1       2
                                                                                                                                       (
                                                                                                    δ [j Lj] ( k ) = ψ ′ u [jLj] ( k ) e j j ( k ) ,
                                                                                                                                               1       2
                                                                                                                                                               )           1 2


where 0 ≤ β ≤ 1 is smoothing parameter [9].
                                                                                                                                n2


   On the basis of the model from Eq. (1), it is easy to
                                                                                                   oˆi[1 ] ( k )
                                                                                                   =
                                                                                                        L −1
                                                                                                                               ∑ b[ ] ( k − 1) o[ ] ( k ) ,
                                                                                                                               i2 =1
                                                                                                                                            L
                                                                                                                                           j1 j2
                                                                                                                                                                             L −1
                                                                                                                                                                           i1i2


introduce its nonlinear modification that can be written in the
following form:                                                                                        1       2
                                                                                                                                       (
                                                                                                δ A[ Lj] j ( k ) = ψ ′ u [AL]j j ( k ) eA j j ( k ) ,
                                                                                                                                                   1       2
                                                                                                                                                                   )             1 2


                                  ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic
                                                                                                                                               143

                                                                 n1
                                                                                                                                                          III. COMPUTATIONAL EXPERIMENTS
                                     oˆ[A i2 ] ( k ) = ∑ a[j1 j]2 ( k ) oi[1i2 ] ( k ) ;
                                         L −1               L               L −1

                                                                i1 =1                                                                                The efficiency of the proposed system and learning
- for the l th hidden layer, 1 < l < L :                                                                                                         methods was demonstrated on the classification task.
                                                                                                                                                 A number of experiments was carried out on the MNIST
        a[jl ]j (=k ) a[j1]j2 ( k − 1) + η A ( k ) δ [j1 ]j2 ( k ) oˆi[1 ] ( k ) ,
                            l                              l              l −1
                                                                                                                                                 dataset that was introduced by Yann LeCun and Corinna
          [l ]
              1 2


                    k ) b[j1 ]j2 ( k − 1) + η B ( k ) δ A[ ]j1 j2 ( k ) oˆ[A i2 ] ( k )
         b j1 j2 (=
                           l                              l                 l −1                                                                 Cortes [26].
                                                                                                                                                     This dataset is widely used for training and testing in
where                                                                                                                                            machine learning, namely in the classification task. This

                                                    (                                    )
                                                                                               n1                                                dataset contains 60000 training observations and 10000 test
                 δ [jl ]j ( k ) = ψ ′ u [jl ]j ( k ) ∑ δ [jl +j 1] a[jl +j 1] ( k ) ,
                            1   2                               1        2                             1       2       1   2                     observations.
                                                                                              i1 =1
                                                                                                                                                     Each observation is an image of size 28x28 pixels that
                                                         n2
                                                                                                                                                 represents a handwritten digit. In general the dataset has
                                oˆi[1 ] ( k )
                                =
                                     l −1
                                                        ∑ b[ ] ( k − 1) o[ ] ( k ) ,
                                                        i2 =1
                                                                              l
                                                                             j1 j2
                                                                                                             l −1
                                                                                                           i1i2                                  10 classes (digits from 0 to 9).
                                                                                                                                                     Some examples of the images from this dataset are
                                               (                                     )
                                                                                           n2
           δ A[l ]j j ( k ) = ψ ′ u [Al ]j j ( k ) ∑ δ A[l +j1j] ( k ) b[jl +j 1] ( k ) ,
                                                                                                                                                 presented in Fig. 1.
                1       2                                   1       2
                                                                                          i2 =1
                                                                                                           1       2               1       2         The elements of an image are represented by pixel values
                                                                n1
                                                                                                                                                 from 0 to 255, where 0 means white pixel (background) and
                                     oˆ[A i2 ] ( k ) = ∑ a[j1]j2i1 ( k ) oi[1i2 ] ( k ) ;
                                         l −1               l                l −1                                                                255 means black pixel (foreground). These values were
                                                            i1 =1                                                                                preprocessed before training using normalization. The inputs
- for the first hidden layer:                                                                                                                    for the network were ( n1 × n2 ) -matrices, where n=
                                                                                                                                                                                                    1 n=
                                                                                                                                                                                                       2  28 .
         a[j1]j (= k ) a[j1]j2 ( k − 1) + η A ( k ) δ [j1 ]j2 ( k ) oˆi[1 ] ( k ) ,
                             1                              1              0
                                                                                                                                                 Every hidden layer also had size of n1 × n2 = 28 × 28 .
           [1]
               1 2
                                                                                                                                                     The results of the computational experiments are presented
                     k ) b[j1 ]j2 ( k − 1) + η B ( k ) δ A[ ]j1 j2 ( k ) oˆ[A i]2 ( k )
          b j1 j2 (=
                            1                              1                 0
                                                                                                                                                 in Table 1.
where
                                                                                                                                                              TABLE 1. EXPERIMENTAL RESULTS
                                                    (                                     )
                                                                                                n1
                    δ [j1]j ( k ) = ψ ′ u [j1]j ( k ) ∑ δ [j 2j] a[j2]j ( k ) ,
                            1    2                               1       2                                 1       2   1   2
                                                                                               i1 =1                                                        Number of layers            Error on test set,
                                                        n2                                                                                                   in the network                    %
                                oˆi[1 ] ( k )
                                =
                                        0
                                                    ∑ b[ ] ( k − 1) x ( k ) ,
                                                    i2 =1
                                                                              1
                                                                             j1 j2                         i1i2                                                     3                          25
                                                                                                                                                                    5                          20
                                                (                                    )
                                                                                          n2
            δ A[1]j j ( k ) = ψ ′ u [j1]j ( k ) ∑ δ A[ 2]j j ( k ) b[j2j] ( k ) ,
                    1       2                           1       2
                                                                                         i2 =1
                                                                                                       1       2               1       2
                                                                                                                                                                    10                         18
                                                                    n1
                                       oˆ[A i]2 ( k ) = ∑ a[j1]j2 ( k ) xi1i2 ( k ) .
                                           0                 1

                                                                i1 =1


                                                                                                       Fig.1. Examples of the images from the MNIST dataset.


                                                        ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic
                                                              144

                     IV. CONCLUSION                               [12] M. Mohamadian, H. Afarideh, and F. Babapour,
                                                                       “New 2D        Matrix-Based        Neural    Network for
   In this paper the matrix deep neural network and its                Image Processing Applications,” IAENG (International
learning algorithm are proposed. They allow significantly to           Association of Engineers) International Journal of
reduce the number of adjustable weights due to the rejection           Computer Science, 42(3), pp. 265-274, 2015.
of the vectorization-devectorization operations of 2D input       [13] K. Suzuki, Artificial Neural Networks: Architectures
signals.                                                               and Applications. NY: InTech, 2013.
   One of the main advantages of the proposed system is that      [14] K. L. Du and M. Swamy, Neural Networks and
it also preserves the information between rows and columns             Statistical Learning. Springer-Verlag London, 2014.
of 2D inputs of the system.                                       [15] D. Graupe,       Principles      of     Artificial     Neural
   The considered DNN in comparison with traditional                   Networks (Advanced Series in Circuits and Systems).
multilayer perceptrons has increased speed, determined by              Singapore: World Scientific Publishing Co. Pte. Ltd.,
reduced number of adjustable parameters and optimization of            2007.
the learning algorithm, and the simplicity of numerical           [16] L. Rutkowski, Computational intelligence. Methods and
implementation.                                                        techniques, Berlin-Heidelberg: Springer-Verlag, 2008.
   The proposed system can be used to solve a wide range of       [17] R. Kruse,      C. Borgelt,      F. Klawonn,      C. Moewes,
machine learning tasks, particularly connected with the                M. Steinbrecher,        and      P. Held,     Computational
problems of image processing, where input signals are                  intelligence, Berlin: Springer, 2013.
presented to the system for data processing in the form of a      [18] D. T. Pham and X. Liu, Neural Networks for
matrix.                                                                Identification, Prediction and Control, London:
                       REFERENCES                                      Springer-Verlag, 1995.
                                                                  [19] I. Arel, D. Rose, and T. Karnowski, “Deep machine
[1] C. M. Bishop, Neural Networks for Pattern Recognition.             learning – a new frontier in artificial intelligence
     Oxford : Clarendon Press, 1995.                                   research,” IEEE Computational Intelligence Magazine,
[2] Y. LeCun, Y. Bengio, G. Hinton, “Deep Learning,”                   vol. 5, no. 4, pp. 13-18, 2010.
     Nature, vol. 521, pp. 436-444, 2015.                         [20] K. Kavukcuoglu, P. Sermanet, Y-L. Boureau, K. Gregor,
[3] J. Schmidhuber, “Deep Learning in neural networks: An              M. Mathieu, Y.. LeCun, “Learning Convolutional
     overview,” Neural Networks, vol. 61, pp. 85-117, 2015.            Feature Hierachies for Visual Recognition,” in
[4] I. Goodfellow, Y. Bengio, and A. Courville, Deep                   Proceedings of the 23rd International Conference on
     Learning. MIT Press, 2016.                                        Neural Information Processing Systems, vol. 1,
[5] P. Daniušis and P. Vaitkus, “Neural networks                       pp. 1090-1098, 2010.
     with matrix inputs,” Informatica, 19, №4, pp. 477-486,       [21] C. Dan, U. Meier, and J. Schmidhuber, “Multi-column
     2008.                                                             deep neural networks for image classification,” in
[6] J. Gao, Y. Guo, and Z. Wang, “Matrix neural networks,”             Proceedings of the 2012 IEEE Conference on
     in Proceedings of the14th International Symposium on              Computer Vision and Pattern Recognition (CVPR),
     Neural Networks (ISNN), Part II, Sapporo, Japan, 2017,            pp. 3642-3649, 2012.
     pp. 1–10.                                                    [22] A. Krizhevsky,       I. Sutskever,     and      G. E. Hinton,
[7] Ye. V. Bodyanskiy, I. P. Pliss, and V. A. Timofeev,                “ImageNet classification with deep convolutional
     “Discrete        adaptive        identification        and        neural networks,” in Proceedings of the 25th
     extrapolation of two-dimensional      fields,”     Pattern        International Conference on Neural Information
     Recognition and Image Analysis, 5, №3, pp. 410-416,               Processing Systems (NIPS’12), vol. 1, pp. 1097-1105,
     1995.                                                             2012.
[8] S. Haykin, Neural Networks: A Comprehensive                   [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual
     Foundation. Upper Saddle River, N. J. : Prentice Hall,            Learning for Image Recognition,” in 2016 IEEE
     Inc., 1999.                                                       Conference on Computer Vision and Pattern
[9] S. Vorobyov, Ye. Bodyanskiy, “On a non-parametric                  Recognition (CVPR), pp. 770-778 2016.
     algorithm for smoothing parameter control in adaptive
                                                                  [24] Y. LeCun,        K. Kavukcuoglu,         and       C. Farabet,
     filtering,” Engineering Simulation, vol. 16, p. 314-320,
                                                                       “Convolutional networks and applications in vision,” in
     1999.
                                                                       Proceedings of 2010 IEEE International Symposium
[10] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and
                                                                       on Circuits and Systems (ISCAS), pp. 253-256, 2010.
     M. S. Lew, “Deep learning for visual understanding:
                                                                  [25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
     A review,” Neurocomputing, vol. 187, pp. 27-48, 2016.
                                                                       D. Anguelov,              D. Erhan,           V. Vanhoucke,
[11] P. Stubberud,     “A    vector     matrix     real   time
                                                                       and A. Rabinovich, “Going deeper with convolutions,”
     backpropagation algorithm for recurrent neural
                                                                       in Proceedings        of      the     IEEE        Conference
     networks that approximate multi-valued periodic
                                                                       on Computer Vision and Pattern Recognition, pp. 1-9,
     functions,”        International        Journal         on
                                                                       2015.
     Computational Intelligence and Application, 8(4),
                                                                  [26] http://yann.lecun.com/exdb/mnist/
     pp. 395-411, 2009.


                         ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic