=Paper= {{Paper |id=Vol-2240/paper18 |storemode=property |title=The Hierarchical Learning Algorithm for Deep Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2240/paper18.pdf |volume=Vol-2240 |authors=Stanislaw Placzek,Aleksander Placzek |dblpUrl=https://dblp.org/rec/conf/csp/PlaczekP18 }} ==The Hierarchical Learning Algorithm for Deep Neural Networks== https://ceur-ws.org/Vol-2240/paper18.pdf
    The Hierarchical Learning Algorithm for Deep Neural
                         Networks
                      Stanislaw Placzek1 and Aleksander Placzek2
                            1
                              University of Business in Wroclaw, Poland
                                      stanislaw.placzek@wp.pl
2
  Silesian University of Technology, Faculty of Automatic Control, Electronic and Computer Science,
                                       WASKO Gliwice, Poland
                                        a.placzek@wasko.pl
                                                Abstract
         In the article, the emphasis is put on the modern artificial neural network (ANN) struc-
     ture, which in the literature is known as a deep neural network. A network includes more
     than one hidden layer and comprises many standard modules with ReLu nonlinear activa-
     tion function. A learning algorithm includes two standard steps, forward and backward,
     and its effectiveness depends on the way the learning error is transported back through all
     the layers to the first layer. Taking into account all the dimensionalities of matrixes and
     the nonlinear characteristics of ReLu activation function, the problem is very challenging.
     In practice tasks, a neural networks internal layer matrixes with ReLu activations function,
     include a lot of null value of weight coefficients. This phenomenon has a negative impact
     on the effectiveness of the learning algorithm’s convergence. Analyzing and describing an
     ANN structure, one usually finds that the first parameter is the number of ANNs layers
     ”L”. By implementing the hierarchical structure to the learning algorithm, an ANN struc-
     ture is divided into sub-networks. Every sub-network is responsible for finding the optimal
     value of its weight coefficients using a local target function to minimize the learning error.
     The second coordination level of the learning algorithm is responsible for coordinating the
     local solutions and finding the minimum of the global target function. In each iteration
     the coordinator has to send coordination parameters into the first level of subnetworks.
     By using the input and the teaching vectors, the local procedures are working and finding
     their weight coefficients. At the same step the feedback error is calculated and sent to the
     coordinator. The process is being repeated until the minimum of all the target functions
     is achieved.


1    Deep neural network structure
A deep neural network is built with topologically and logically uniform modules known as
layers. A networks structure includes an input layer, a lot of hidden layers, and finally an
output layer. Usually, a network is built with 1 ÷ L layers of different or identical structure.
From a mathematical point of view, a layer could be described by a matrix of weight coefficients
W l , an input vector X (l−1) , a hidden vector U l , an activation function ReLu(U ) = max(0, U ),
and an output vector X l (Fig.1).
    A deep neural network includes L ÷ 1 hidden layers and one output layer which contains a
different activation function. An output layer needs to aggregate a set of partial features from
previous layers to achieve the final result, that is an output signal. Using Fig.1, one can define
the target function:
                                        1
                                   Φ = · (X L − Y )T · (X L − Y )                              (1)
                                        2
Where:
X L [1 ÷ N L ]-the output vector,
CS&P 2018 Berlin                          The Hierarchical Learning Algorithm for Deep Neural Networks




                            Figure 1: Deep Neural Network structure


N L -the dimensionality of the output vector,
Y [1 : N L ]-the vector of teaching data.

For all the hidden layers and for forward calculation, one can write:

                                         U l = W l · X l−1                                        (2)


                                           X l = F (U l )                                         (3)
Where:
l = 1 ÷ L- number of layers in a deep neural network,
U l - the internal vector for layer l,
F - the vector of activation function,
W l -the matrix of weight coefficients for layer l,
X l−1 , X l - the input and output vector of layer l, accordingly.

The process of selecting an activation function is an essential and difficult task. When building
standard networks, usually sigmoid and tanh activation functions are used. A sigmoid function
has two areas of value in which the function, in an asymptotic way, achieves the value of zero or
one. This characteristic has a negative impact on the derivative value and, at the same time, on
the algorithm convergence. At the moment, a new activation function is used in a deep neural
network, a Rectified Linear Unit: ReLu, which is defined as follows:

                                    f = ReLu(u) = max(0, u)                                       (4)

From a mathematical point of view, this function is discontinuous for u = 0. In a computers
application, this problem is solved by the accepted value f = 0 for u = 0.

1.1    Learning algorithm
In computer applications, a back propagation learning algorithm is the most popular one. A
learning error is calculated from an output layer, through all the hidden layers to the input
data. From (1), one can calculate the first derivatives for the output layer, denoting:
                                       ∂Φ
                                           = EL = X L − Y                                         (5)
                                      ∂X L

                         ∂Φ     ∂Φ ∂X L ∂U L      L ∂X
                                                        L
                                                            ∂U L
                             =     ·    ·     = E  ·      ·                                       (6)
                        ∂W L   ∂X L ∂U L ∂W L        ∂U L ∂W L

2
CS&P 2018 Berlin                             The Hierarchical Learning Algorithm for Deep Neural Networks


Where:
E L = [L    L       L
        1 , 2 , ...N L ]-the vector of an output layer error.



   The derivatives of the ReLu function could be written as follows:

                                  ∂X L                0
                                        = max(0, U L ) = 1(U L )                                     (7)
                                  ∂U L
The last part of equation (6) is calculated:
                                       ∂U L
                                            = X L−1                                                  (8)
                                      ∂W L
   Finally, formula (6) can be written in the matrix form using the Hadamard                    product
notation :
                                   ∂Φ
                                       = {E L       1(U )L } · (X L−1 )T                             (9)
                                  ∂W L
                    ∂Φ        ∂Φ ∂X L       ∂U L        L ∂X
                                                              L
                                                                   ∂U L
                          =        ·      ·         = E  ·      ·                                   (10)
                  ∂X L−1     ∂X L ∂U L ∂X L−1              ∂U L ∂X L−1
   Using the same notation for an output layer, formula (10) can be written as:

                                           ∂Φ           ∂X L    ∂U L
                                E L−1 =      L−1
                                                 = EL ·    L
                                                             ·                                      (11)
                                          ∂X            ∂U     ∂X L−1
   In the matrix form:
                                   E L−1 = (W L )T · {1(U )L       EL}                              (12)
                            L
The output layer error E has to be translated into the previous layer L − 1, this process will
be repeated up to the first hidden layer. Fig. 3. shows the full scheme for a back propagation
algorithm and how a layer’s error is translated back through the network. The final back
propagation formulas have a recurrent structure. An algorithm will start from the output layer
L and going through all the hidden layers, will achieve the first hidden layer 1:
                                    E l−1 = (W l )T · {1(U )l     El}                               (13)

                                     ∂Φ
                                         = {E l     1(U )l } · (X l−1 )T                            (14)
                                    ∂W l
Where:
l = 1 : L.
According to Fig. 2. the layer is laid between input vector X (l−1) and output vector X l . The
same border is used for the back propagation error from E l to E (l−1) . In a standard neural
network, a sigmoid activation function is used. An output error is translated back to the first
layer decreasing its value and finally, an algorithm can calculate zero’s value. This property has
a very negative impact on convergence. Therefor, it is the main reason to use ReLu activation
function, especially in a deep neural network. Its derivative is equal to 1 or to 0 - the Heaviside
step function. Some derivatives (14) are equal to 0 and the weight coefficient does not change
in the actual iteration process (formula 14). The same can be observed in the error back
propagation function (formula 13).
    Taking into account all the limitations mentioned above, a neural network learning algorithm
is decomposed, and a new coordination level is implemented. Two or more levels could be used
to improve the learning algorithm convergence.

                                                                                                       3
CS&P 2018 Berlin                         The Hierarchical Learning Algorithm for Deep Neural Networks




             Figure 2: A scheme of an error back propagation through the layers

2     Learning algorithm decomposition
For a multi-layer ANN, a lot of hidden layers and one output layer are sectioned off. The
smaller part will be described as a sub-network. Every sub-network has its own output vector,
an input vector of the succeeding one X l , and a local target function Φl where l = 1 ÷ L .
Because of the specific organization of an ANNs hierarchy there are many sub-networks on the
first level, for each of which local target functions are defined:

                                 Φ = {Φ1 , Φ2 ....Φl ...ΦL−1 , ΦL }                             (15)
    These sets of local tasks have to be coordinated to achieve the global solution. The co-
ordinator, as an independent task, will have its own target function. Taking everything into
account, this concept is the base on which one may build the new scheme of the ANN learning
algorithm’s structure (Fig. 3). It is the neural network ’s and the learning algorithm’s hierar-
chical structure. The two-level ANN learning algorithm can be described as a set of procedures.
The procedures on the first level are responsible for solving their local tasks and calculating the
part of matrix weight coefficients. The second-level procedure has to coordinate all the local
procedures (tasks) using its own local target function. The third-level procedure calculates
the learning parameters which are used by the second-level. There is a vertical decomposition
and interaction between the procedures. Two types of information are sent between the levels.
From the second level to the first level, one is a downward transmission of control signals:
                                      Γ = (Γ1 , Γ2 , ...ΓL−1 )                                  (16)
Where:
Γl - vector of data sent from the coordinator to the two neighboring sub-networks l and l + 1,
l = 1 ÷ (L − 1) - number of sub-networks,

From the first level to the second level two sets of feedback signals are sent:
    • forward feedback errors, which are generated by every sub-network when all sub-networks
      calculate their local target functions value Φl :
                                        EF = (E 1F , E 2F , ...E L−1
                                                                 F   )                          (17)

4
CS&P 2018 Berlin                        The Hierarchical Learning Algorithm for Deep Neural Networks




             Figure 3: A scheme of an error back propagation through the layers


   • backward feedback errors, which are calculated by every sub-network in the back propa-
     gation procedure:
                                      EB = (E 1B , E 2B , ...E L
                                                               B)                      (18)


2.1    Levels of calculation complexity
The standard ANN learning algorithm is a non-linear minimization task without constraints.
To solve this task, iteration procedures are used. Using the most popular back propagation
algorithm, one has to choose a lot of control parameters. The algorithm is time-consuming
and its convergence is not fast. Dividing the primary algorithm into the sub-network tasks,
the local target functions are simpler and can be used in different procedures. Additionally, a
new procedure is needed: the coordination procedure. In practice, however, the coordinator
does not have the ability to find all the parameters needed for the first-level procedures. To
solve this problem, a multi-level decision hierarchy is proposed [1]. The problems is solved
by the iteration algorithm on both the first and the second level. One can observe specific
dynamic processes. These processes are non-linear and use a lot of control parameters. During
the learning process these parameters are stable and do not change. Practice proves that this
solution is not optimal. To control the way learning parameters are changed in the iteration
process, an additional level could be used - the adaptation level (Fig. 3). Thus, one can build
three levels as a minimum:

   • The local optimization procedures: the algorithm is defined directly as a minimization
     task without constraints.

   • The coordination procedure: this algorithm could be defined directly as a minimization
     of the target function as well. Constraints could exist or not.

                                                                                                  5
CS&P 2018 Berlin                            The Hierarchical Learning Algorithm for Deep Neural Networks


    • The adaptation procedure: the task or procedure on this level should specify the value
      of learning parameters not only for the coordinator level, but also on the first level. To
      solve this task, a procedure should achieve dynamic characteristic of the learning process
      from all the levels.
As a conclusion, one can state that the complexity of the problem increases from the first level
to the next one. The coordination and adaptation procedures need more time to solve their
own procedure.


3     Calculation algorithm structure
The deep neural network with an input layer, a set of hidden layers and an output layer can
be used for further considerations. As standard ReLu activation function is used for all the
hidden layers. For an output layer both sigmoid or ReLu activation functions could be used. A
three part learning algorithm will be considered to decompose the standard learning algorithm’s
structure into sub-network tasks and taking into account Fig. 3.

3.1    Forward calculation
All sub - networks are independent because the coordinator sent an input and an output vector
Γl for all the layers. Sub-networks can calculate all values in a parallel way:
    • For the first sub-network:
                                                     1
                            Φ1 (W 1 , X 0 , Γ1 ) =     · (X 1 − Γ1 )T · (X 1 − Γ1 )                (19)
                                                     2

      Others relations:
                                                X 1 = F (U 1 )                                     (20)

                                               U 1 = W 1 · X0                                      (21)

      Where: X 0 , X 1 - input and output vector of the first layer, accordingly,
      F - vector of activation function. ReLu = max(0, U 1 ),
      U 1 - internal vector.
    • For the all hidden layers l = 2 : (L − 1):
                                                       1
                             Φl (W 1 , Γl−1 , Γl ) =     · (X l − Γl )T · (X l − Γl )              (22)
                                                       2

                                                 X l = F (U l )                                    (23)

                                               U l = W l · Γl−1                                    (24)
      Where:
      X l−1 , X l - input and output vector of all the hidden layers, accordingly,
      F - vector of activation function. ReLu = max(0, U l ),
      U 1 - internal vector.


6
CS&P 2018 Berlin                          The Hierarchical Learning Algorithm for Deep Neural Networks


   • For the output layer:


                                                    1
                           ΦL (W L , ΓL−1 , Y ) =     · (X L − Y )T · (X L − Y )                 (25)
                                                    2

                                             X L = F (U L )                                      (26)


                                           U L = W L · ΓL−1                                      (27)

      Where:
      X L - output vector,
      Y - learning vector P epoch included.


3.2    Backward calculation
When all sub-networks have finished the forward calculation process, the next step can begin,
that is the backward calculation. The calculated error will be sent to the coordinator. Modified
formulas from subsection 1.1 are used. All subnetworks with their own target functions can
calculate their own backward errors in a parallel way:

   • For the first sub-network, only the forward error is calculated,(see Fig.4):

                                                EF1 = X 1                                        (28)

      From formula(19) partial derivatives for the matrix of weight coefficient W 1 are:

                                       ∂Φ1    ∂Φ1 ∂X 1 ∂U 1
                                          1
                                            =     ·    ·                                         (29)
                                       ∂W     ∂X 1 ∂U 1 ∂W 1


                                         ∂Φ1
                                              = E 1 = X 1 − Γ1                                   (30)
                                         ∂X 1

      Formula (29) can be rewritten in the matrix form:

                                      ∂Φ1
                                           = {E 1     1(U 1 )} · (X 0 )T                         (31)
                                      ∂W 1

      Where:                                                                  0
                                                               0
      1(U 1 ) - derivative of the ReLu activation function ReLu = max(0, U 1 ) = 1(U 1 ).
      The first sub-network can calculate a new value of the matrix weight coefficients W 1 :

                                                                   ∂Φ1
                                   W 1 (n + 1) = W 1 (n) − α ·                                   (32)
                                                                   ∂W 1
      Where:
      n - current iteration number.

                                                                                                    7
CS&P 2018 Berlin                          The Hierarchical Learning Algorithm for Deep Neural Networks


    • For all the hidden layer l = 2 : (L − 1) and using formula (22), partial derivatives for the
      matrix of weight coefficient W l and an input vector from the coordinator Γl−1 are:

                                        ∂Φl     ∂Φl ∂X l ∂U l
                                          l−1
                                              =     ·    ·                                       (33)
                                       ∂Γ       ∂X l ∂U l ∂Γl−1

                                          ∂Φl
                                               = E l = X l − Γl                                  (34)
                                          ∂X l

                                            ∂U l
                                                  = Wl                                           (35)
                                           ∂Γl−1
      Formula (33) can be rewritten in the matrix form:

                                  ∂Φl
                                       = E l−1 = (W l )T · {1(U )l          E l}                 (36)
                                 ∂Γl−1
      Where:                                                                  0
                                                               0
      1(U l ) - derivative of the ReLu activation function ReLu = max(0, U l ) = 1(U l ),

                                                     ∂Φl
                                E l−1 l
                                  B =Γ −β·                = Γl − β · E l−1                       (37)
                                                    ∂Γl−1
      The derivative for the weight coefficients of matrix W l is:

                                        ∂Φl    ∂Φl ∂X l ∂U l
                                             =     ·    ·                                        (38)
                                        ∂W l   ∂X l ∂U l ∂W l

                                              ∂U l
                                                   = Γl−1                                        (39)
                                              ∂W l
      Formula (38) can be rewritten i the matrix form:

                                      ∂Φl
                                           = {E l     1(U l )} · (Γl−1 )T                        (40)
                                      ∂W l

      A sub-network can calculate the new value of the matrix weight coefficients:

                                                                   ∂Φl
                                    W l (n + 1) = W l (n) − α ·                                  (41)
                                                                   ∂W l
      Where:
      n - current iteration number.
    • For the output layer using formula (25) partial derivatives for the matrix of weight coef-
      ficient W L and an input vector from the coordinator Γl−1 are:

                                     ∂ΦL    ∂ΦL ∂X L ∂U L
                                          =     ·    ·                                           (42)
                                    ∂ΓL−1   ∂X L ∂U L ∂ΓL−1

                                         ∂ΦL
                                              = EL = XL − Y                                      (43)
                                         ∂X L

8
CS&P 2018 Berlin                         The Hierarchical Learning Algorithm for Deep Neural Networks



                                              ∂U L
                                                   = WL                                         (44)
                                             ∂ΓL−1
      Formula (42) can be rewritten in the matrix form:

                                ∂ΦL
                                     = E L−1 = (W L )T · {1(U )L         E L}                   (45)
                               ∂ΓL−1

                              l−1                 ∂ΦL
                             EB   = ΓL−1 − β ·         = ΓL−1 − β · E L−1                       (46)
                                                 ∂ΓL−1
      Applying the same procedure as above, a set of formulas is written:

                                      ∂ΦL    ∂Φl ∂X L ∂U L
                                           =     ·    ·                                         (47)
                                      ∂W L   ∂X L ∂U L ∂W L

                                             ∂U L
                                                  = ΓL−1                                        (48)
                                             ∂W L
      Formula (47) in the matrix form:

                                   ∂ΦL
                                        = {E L     1(U L )} · (ΓL−1 )T                          (49)
                                   ∂W L
      The output sub-network can calculate the new value of the matrix weight coefficients:

                                                                 ∂ΦL
                                   W L (n + 1) = W L (n) − α ·                                  (50)
                                                                 ∂W L
      Where:
      n - current iteration number.
     When all sub-networks have finished calculation of their local target functions, the forward
E lF and backward E lB feedback information are sent to the coordinator. At the moment all the
sub- networks modified of their matrixes of weight coefficient W l for l = 1 : L.


4     Coordinator structure
In a hierarchical learning algorithm, the coordinator plays the main role. It is now time to decide
what kind of coordination principle will be chosen. This principle specifies various strategies
for the coordinator and determines the structure of the coordinator. In [1] three methods were
introduced in which the interaction could be performed:
    • Interaction Prediction. The coordination input may involve a prediction of the interface
      input.
    • Interaction Decoupling. Each first-level sub-system is introduced into the solution of
      its task and can treat the interface input as an additional decision variable to be free.
      Consequently, sub-systems are completely decoupled.
    • Interaction Estimation. The coordinator specifies the ranges of interface inputs over which
      they may vary.

                                                                                                   9
CS&P 2018 Berlin                          The Hierarchical Learning Algorithm for Deep Neural Networks


For a deep neural network containing L layers, the coordinator prepare l = 1 ÷ (L − 1) coordi-
nation signal Γl . One of them is treated as input vectors and the other as learning data vectors
for local target functions Φl for l = 1 : (L − 1). Every coordinator signal is correlated with two
feedback signals E lF and E lB . These signals are calculated by the first level sub-networks and
are sent to coordinator. The coordinator uses its own target function:
                      i=L−1
                     1 X
                Ψ=         {(Γi − EB
                                   i T
                                     ) · (Γi − EB
                                                i
                                                  )} + {(Γi − EFi )T · (Γi − EFi )}              (51)
                     2 i=1

To minimize this function, the gradient method is used:

                        ∂Ψ
                            = (Γi − EB
                                     i
                                       ) + (Γi − EFi ) = 2 · Γi − [EB
                                                                    i
                                                                      + EFi ]                    (52)
                        ∂Γi
     The new value of coordination signal is calculated using the gradient method:

                                                               ∂Ψ
                                   Γi (n + 1) = Γi (n) − ρ ·                                     (53)
                                                               ∂Γi
   This new coordinator signals value is sent to the first level sub-networks and the entire
process is repeated.


5      Adaptation level
In formula (53) the learning parameter ρ could be constant or could change during the iteration
process. In the beginning the learning process is very dynamic, a large oscillation could be seen
in the first level of the local target functions. Because parameter ρ has to be small, the iteration
process is stable. However, although the minimum Ψ is achieved asymptotically, it happens
very slowly. To improve this algorithm, the coordinator sends the target function value Ψ(n)
to the adaptation level. The following simple algorithm could be used:
If Ψ(n) > Ψ(n − 1) then ρ = kd · ρ, where: kd = 0.95 ÷ 0.98 - decreasing parameter,
If Ψ(n) < Ψ(n − 1) then ρ = ki · ρ, where: ki = 1.02 ÷ 1.05 - increasing parameter. The new ρ is
sent to the coordinator and used in the iteration process. To make the entire learning process
more stable, Ψ is usually averaged by epoch:
                                                                p=Np −1
                                      1              Np − 1          X
                      Ψ(n, p + 1) =      · Ψ(n, p) +        ·              Ψ(n, i)               (54)
                                      Np              Np             p=1

Where: Np - number of input vector in epoch, p - actual index value.


6      Numerical example and conclusion
In a life insurance company the underwriting process has been playing the central role in risk
control and premium calculation. A deep neural network could be used to help the insurance
agents to classify the insurance applicant and calculate the first version of premium. Therefore,
a special short questionnaire was prepared which includes only 10 main questions (Fig.4). All
the data were divided into three subsets: - learning set includes 250 records, -verification set
includes 50 vectors,- testing includes 100 vectors.

10
CS&P 2018 Berlin                       The Hierarchical Learning Algorithm for Deep Neural Networks




                         Figure 4: An input data structure example




              Figure 5: Dynamic characteristic of the second target function Φ2




                                                                         2
          Figure 6: Dynamic characteristics of the two feedback signals EB and EF2


                                                                                                11
CS&P 2018 Berlin                         The Hierarchical Learning Algorithm for Deep Neural Networks




                     Figure 7: Dynamic characteristics of the coordinator


    Dynamic characteristics of the second subnetwork contain two phases. From start to 4000
iteration, error decrease very fast. In the next phase, the learning process is not optimal.
    In Fig. 6. the feedback signals do not have optimal characteristics. Probably the learning
coefficient ρ is too hight and the coordinator tries to accelerate the learning process. Result
is reverse. The adaptation level should force its own strategy to stabilize the learning process.
Future works should focus on the coordinator and the adaptation level strategy. The main
question is - how to improve the characteristics of the learning process?


References
[1] M. D. Mesarocic, D. Macko, and Y. Takahara, Theory of hierarchical multilevel systems, Academic
    Press, New York and London, 1970.
[2] Ch. M. Bishop, Pattern Recognition and Machine Learning, Springer Science + Business Media,
    LLC 2006.
[3] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016
[4] Zeng-Guang Hou, Madan M. Gupta, Peter N. Nikiforuk, Min Tan, and Long Cheng, ”A Recurrent
    Neural Network for Hierarchical Control of Interconnected Dynamic Systems”, IEEE Transactions
    on Neural Networks, Vol. 18, No. 2, March 2007.
[5] Joarder Kamruzzaman, Rezaul Begg, Artificial Neural Network in Finance and Manufacturing, Idea
    Group Publishing, Hershley, Pennsylvania 2006.
[6] S.K. Zhou, H. Greenspan, D. Shen, Deep Learning for Medical Image Analysis, Academic Press
    2017
[7] M. Nielson, Neural Network and Deep Learning, Determination Press, 2015
[8] S. Placzek, B. Adhikari, ”Analysis of Multilayer Neural Network with Direct Connection Cross-
    forward Connection”, CS&P Conference 2013, The University of Warsaw, Warsaw 2013.
[9] L. Rutkowski, Metody i techniki sztucznej inteligencji, Wydawnictwo Naukowe PWN, Warszawa
    2006.




12