=Paper=
{{Paper
|id=Vol-2533/invited2
|storemode=property
|title=Formal Neuron Based on Adaptive Parametric Rectified Linear Activation Function and its Learning

|pdfUrl=https://ceur-ws.org/Vol-2533/invited2.pdf
|volume=Vol-2533
|authors=Yevgeniy Bodyanskiy,Anastasiia Deineko,Iryna Pliss,Valeriia Slepanska
|dblpUrl=https://dblp.org/rec/conf/dcsmart/BodyanskiyDPS19
}}
==Formal Neuron Based on Adaptive Parametric Rectified Linear Activation Function and its Learning
==
<pdf width="1500px">https://ceur-ws.org/Vol-2533/invited2.pdf</pdf>
<pre>
 Formal Neuron Based on Adaptive Parametric Rectified
     Linear Activation Function and its Learning

     Yevgeniy Bodyanskiy1[0000-0001-5418-2143], Anastasiia Deineko2[0000-0002-3279-3135],

          Iryna Pliss1[0000-0001-7918-7362] and Valeriia Slepanska2[0000-0002-0465-8593]
          1
              Control systems research laboratory, 2 Artificial Intelligence department,
               Kharkiv National University of Radio Electronics, Kharkiv, Ukraine
                            yevgeniy.bodyanskiy@nure.ua
                            anastasiya.deineko@gmail.com
                                   iryna.pliss@nure.ua
                           valeriia.slepanskaia@gmail.com


       Abstract. The paper proposes an adaptive activation function (AdPReLU) for
       deep neural networks which is generalization of rectified unit family, differing
       by opportunity of online tuning its parameters during the learning process of
       neural network. The learning algorithm of formal neuron with adaptive activa-
       tion function which is generalization of delta-rule and in which parameters of
       the function tune simultaneously with synaptic weights, based on error back-
       propagation is developed. The proposed algorithm of tuning is optimized for in-
       creasing of operating speed. Computational experiments confirm the effective-
       ness of the approach under consideration.

       Keywords: deep neural network, adaptive activation function, delta-rule, syn-
       aptic weights, rectified linear unit, learning algorithm.


1      Introduction

At the present time artificial neural networks are widely used for solving Data Science
tasks, due to their possibility to tune parameters and architecture during the process of
information processing and their universal approximative abilities. These properties
provide effective solving the tasks of pattern recognition (classification), time series
processing (prediction), complex non-linear objects and processes emulation (identifi-
cation and adaptive control).
   The most widely used are multilayer perceptrons whose nodes-neurons usually are
the Rosenblatt`s elementary perceptrons with sigmoidal activation functions. Besides
traditional 𝜎-functions [1] the most widespread are tanh, SoftSign, Satlin [2, 3], poly-
nomial activation functions of special type [4] and another squashing functions.
   Based on the classical multilayer perceptrons the deep neural networks (DNN)
were created [5-8]. This has led to increasing of processing images, audio signals,
arbitrary time series, and intelligent text analysis effectiveness. However, there are

Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0)
2019 DCSMart Workshop.
significant computational problems connected with to the so-called vanishing and
exploding gradients connected with specific form of sigmoidal activation functions.
   Consequently, so called rectified unit family [9] is used in DNN as activation func-
tions. There can be noted such functions as leaky rectified linear unit (LReLU), para-
metric restified linear unit (PReLU), randomized leaky rectified linear unit (RReLU),
noisy rectified linear unit (NReLU), exponential linear unit (ELU) [7-12], except
rectified linear unit (ReLU) itself.
   The functions listed above are piecewise linear functions with fixed parameters
chosen by empirical considerations. The advantage is that their derivatives do not
vanish, so they overcome the problem of vanishing gradient and permit to optimize
the speed of learning process. However, these functions do not satisfy G.Cybenko`s
theorem [1] conditions, so for providing required quality of approximation it is nec-
essary to increase the number of hidden layers in the DNN. It causes increasing of
DNN`s computational complexity and learning process speed decreasing.
   Accordingly, it is expedient to introduce in consideration adaptive parametric recti-
fied linear activation function (AdPReLU) within rectified unit family, whose pa-
rameters can tune during learning process like usual neuron`s synaptic weights do,
optimizing adopted learning criterion and improving approximating properties both
individual neuron and neural network in general.


2       Architecture of Neuron with Adaptive Parametric Rectified
        Linear Activation Function

Rosenblatt's perceptron as node of any neural network implements nonlinear mapping
as:
                                             æ           n
                                                                      ö
                            yˆ j ( k ) = y j ç q j 0 + åw ji xi ( k ) ÷ =
                                             è         i =1           ø
                             æ n              ö
                       = y j ç åw ji xi ( k ) ÷ = y j ( wTj x ( k ) ) = y j (u j ( k ))
                             è i =0           ø
    where yˆ j ( k ) - output signal of j-th neuron of network in the moment of discrete

time k = 1, 2, …; x ( k ) = (1, x1 ( k ) , …, xi ( k ) , …, xn ( k ) ) ∈ R ( n +1) – input vector signal,
                                                                 T


q j 0 º w j 0 – bias signal, w j = ( w j 0 , w j1 , …, w ji , …, w jn ) T ∈ R ( n +1) – synaptic weights
vector, adjusting in learning process, u j ( k ) − signal of internal activation, y j (×) –
activation function of j-th neuron, chosen usually by empirical considerations during
the process of learning and functioning of neural network.
   Thus, in the Cybenko's theorem 𝜎-function is used:
                                                                    1
                          yˆ j ( k ) = y j (u j ( k )) =                             (1)
                                                         1 + exp ( -g j (u j ( k ) )
    where g j is а gain parameter which determines the form of this function.
   It should be noticed that derivative of sigmoidal function has the form
                              y j (u j ( k )) = g j yˆ j ( k ) (1 - yˆ j ( k ) )
     that means that it has a form of bell-shaped function. Therefore, the more value of
 yˆ j ( k ) is closer to 0 or 1, the closer the value of derivative to 0, which originates the
vanishing gradient.
     In general form, rectified unit family can be written as
                                                      ïì u ( k ) if u j > 0,
                                    y j (u j ( k )) = í j                                  (2)
                                                       ïîa j u j ( k ) otherwise
   where the a j parameter is chosen by empirical considerations and stays constant
during the learning process. In the standart ReLU a j equal to 0, so:
                               y j (u j ( k )) = 0 if u j ( k ) < 0.
   This may lead to learning process being frozen because of negative values of inter-
nal activation function`s signal.
   The generalization of activation function (2) has the form
                                                 ìï a R u ( k ) if u j > 0,
                               y j (u j ( k )) = í Lj j                            (3)
                                                  ïîa j u j ( k ) otherwise,
   however, there is a problem with the a Rj and the a Lj parameters` values choosing.
So, the solution is to introduce extra procedure of tuning these parameters to the neu-
ron`s learning process. This makes the learning process more sophisticated and leads
to necessity to tune n+3 parameters instead of n+1 adjustable parameters which are
within the w j vector. In spite of that, improvement of approximating properties is
provided, due to the fact that (3) can be performed in different forms, for example:
                                     y j ( u j ( k )) = u j ( k ) .
   There is the schema of neuron with adaptive parametric rectified linear activation
function (shown on fig. 1) where parameters w j , a Rj , a Lj are tuned during the learning
process.


           Fig. 1. Neuron with adaptive parametric rectified linear unit (AdPReLU).
    In Fig. 1 – y j ( k ) - external reference signal, e j ( k ) =                     y j ( k ) - yˆ j ( k ) = y j ( k ) -
y j ( u j ( k ) ) – learning error.


3        Learning Procedure

As learning criterion standard quadratic function is used in the form:
                                                                                                               2
                                                         1æ                æ n              öö
               1           1
                               (                               )
    E j ( k ) = e2j ( k ) = y j (k ) -y j ( u j ( k ) ) = ç y j ( k ) -y j ç åw ji xi ( k ) ÷ ÷ .
                                                       2

               2           2                             2è                è i =0           øø
   Its minimization by gradient procedure leads to algorithm of synaptic weights` tun-
ing that can be written in the form:
                                        ¶E ( k ) ¶e j ( k )                                ¶e ( k )
    w ji ( k ) = w ji ( k - 1) - h ( k ) j                  = w ji ( k ) -h ( k ) e j ( k ) j       =
                                        ¶e j ( k ) ¶w ji                                    ¶w ji
                                                               ¶e j ( k ) ¶u j ( k )
                        = w ji ( k - 1) -h ( k ) e j ( k )                             =
                                                               ¶u j ( k ) ¶w ji
                       = w ji ( k - 1) + h ( k ) e j ( k )y j (u j ( k )) xi ( k ) =
                                = w ji ( k - 1) +h ( k ) d j ( k ) xi ( k )
    or in the vector form:
                                w j ( k ) = w j ( k - 1) + h ( k ) d j ( k ) xi ( k )
    where h ( k ) - is a learning rate parameter, d j ( k ) = e j ( k )y ¢j u j ( k ) −      (        )         -error.
    For standard hyperbolic tangent function it can be written as:
              ¶y ( u j )
                ¶u j
                                   (                       )
                         = g j 1 - ( tanh g j u j ) = g j ( sechg j u j ) = g j (1 - yˆ j 2 )
                                                   2


                     w j ( k ) = w j ( k - 1) +h ( k ) e j ( k ) g j (1 - yˆ j 2 ( k ) ) x ( k ) .                     (4)
    Obviously, if yˆ j ( k ) ® ±1 «vanishing gradient» effect is appeared. For improve-
ment of algorithm (4) convergence in [13] it was proposed to tune gain parameter g j
according to the procedure:
              ¶y ( u j )
                ¶u j
                                   (                       )
                         = g j 1 - ( tanh g j u j ) = g j ( sechg j u j ) = g j (1 - yˆ j 2 )
                                                   2


                         w j ( k ) = w j ( k - 1) +h ( k ) e j ( k ) g j (1 - yˆ j 2 ( k ) ) x ( k ) .                 (5)
  that also leads to «vanishing gradient» effect.
  Neuron`s learning scheme, which is shown in Fig. 1, using backpropagation proce-
dure, begins with tuning of a Rj and a Lj parameters. To simplify the transformations
lets skip the R and L indexes temporarily.
   Then
                                                                   ¶E j ( k )
                           a j ( k ) = a j ( k - 1) - ha ( k )                   = a j ( k - 1) +
                                                                      ¶a j
                               +ha ( k ) ( y j ( k ) - a j ( k - 1) u j ( k ) ) u j ( k ) =                               (6)
          = a j ( k - 1) + ha ( k ) ( y j ( k ) - a j ( k - 1) w ( k -1) x ( k ) ) w ( k - 1) x ( k ) .
                                                                       T
                                                                       j
                                                                                                T
                                                                                                j

   The parameter learning process of AdPReLU activation function (6) can be opti-
mized for the increasing of operating speed. So it can be provided following trans-
formations:
                    a j ( k ) = a j ( k - 1) + ha ( k ) ( y j ( k ) - a j ( k -1) u j ( k ) ) u j ( k ) ,
           a j ( k ) u j ( k ) = a j ( k - 1) u j ( k ) + ha ( k ) ( y j ( k ) - a j ( k - 1) u j ( k ) ) u 2j ( k ) ,
             y j ( k ) - a j ( k ) u j ( k ) = y j ( k ) - a j ( k - 1) u j (k ) -ha (k ) e j (k ) u 2j (k ) ,
                                        e! j ( k ) = e j ( k ) -ha ( k ) e j ( k ) u2j (k ) ,
                     e! j 2 ( k ) = e j 2 ( k ) - 2ha ( k ) e2j ( k ) u2j ( k ) +ha2 (k )e2j (k )u4j (k ),
                        ¶e! j 2 ( k )
                                        = -2e j 2 ( k ) u 2j ( k ) + 2ha ( k ) e 2j ( k ) u 4j ( k ) = 0,
                           ¶ha
   which suggest that optimal value of learning rate parameter ha ( k ) is determined
by expression:
                                                      ha ( k ) = u-j 2 ( k ) .                                            (7)
  Then by substitution (7) into (6) and returning to R and L indexes the next result
was gotten:
            ì a Rj ( k ) = a Rj ( k - 1) + ( y j ( k ) - a Rj ( k - 1) u j ( k ) ) u -j 1 ( k ) if u j ( k ) > 0,
            ï                                                                                                     (8)
            í L
            ï
            î a j  ( k ) = a L
                             j  ( k - 1) + ( y j ( k ) - a L
                                                           j  ( k - 1) u j ( k ) ) u -1
                                                                                     j   ( k ) otherwise     .
   After a Rj and a Lj parameters` are tuned, it can be possible to return to w j synap-
tic weights learning. In this case the learning criterion is based on eˆ j ( k ) error, i.e.:
                                   1              1
                       E! j ( k ) = e! j 2 ( k ) = ( y j ( k ) - a j ( k ) wTj x ( k ) ) .
                                                                                        2
                                                                                                                          (9)
                                   2              2
   The gradient minimization (9) by w j leads to the procedure:
        w j ( k ) = w j ( k - 1) - h ( k ) Ñw j E! j ( k ) = w j ( k - 1) + h ( k ) e! j ( k ) a j ( k ) x ( k ) =
               = w j ( k - 1) + h ( k ) ( y j ( k ) - a j ( k ) wTj ( k - 1) x ( k ) ) a j ( k ) x ( k ) =               (10)

                          = w j ( k - 1) +h ( k ) ( y j ( k ) - wTj ( k - 1) x! ( k ) ) x! ( k )
   where x! ( k ) = a j ( k ) x ( k ) .
   It`s simple to notice, that algorithm (10) is basically a learning procedure of neu-
ron-Adaline [2], what means that it can be optimized by operating speed. As a result,
we obtain optimized one-step Kaczmarz-Widrow-Hoff learning algorithm [14, 15] in
the form:
                                                              y j ( k ) - wTj ( k - 1) x! ( k )
                          w j ( k ) = w j ( k - 1) +                                                    x! ( k ) =
                                                                          x! ( k )
                                                                                      2
                                                                                                                                  (11)
                                              = w j ( k - 1) + e! j ( k ) x!     +T
                                                                                      (k )
    where ( × ) - is a symbol of pseudoinversion.
               +


   For preventing from “exploding gradient”, regularized version of (11) can be con-
sidered:
                      w j ( k ) = w j ( k - 1) + ( x! ( k ) x! T ( k ) + a I ) e! j ( k ) x! ( k )
                                                                                          -1


  where a > 0 – is a momentum term. Using matrix inversion lemma we can finally
obtain expression:
                                                                      e! j ( k ) x! ( k )
                                    w j ( k ) = w j ( k - 1) +                                 ,
                                                                     a + x! ( k )
                                                                                          2


   that coincides with the additive form of Kaczmarz`s algorithm.
   For providing additive filtering properties to learning algorithm (12), the procedure
[16-18] can be used:
                                    e! j ( k ) x! ( k )                                                 e! j ( k ) x! ( k )
       w j ( k ) = w j ( k - 1) +                         = w j ( k ) = w j ( k - 1) +
                                         r (k )                                                    b r ( k - 1) + x! ( k )
                                                                                                                              2


    (where 0 ≤ 𝛽 ≤1 – is a forgetting factor), which coincides with algorithm (11) if
𝛽=0. However, if 𝛽=1 it coincides with stochastic approximation algorithm of Good-
win-Ramadge-Caines [19], which provides convergence in the conditions of stochas-
tic disturbances and noises.
    Consequently, the resulting synaptic weights learning procedure can be written as:
                    ì
                    ï w j ( k - 1) +
                                     ( y j ( k ) - a Rj ( k ) wTj ( k - 1) x ( k ) ) a Rj ( k ) x ( k ) ,
                    ï                                              rR (k )
                    ï R
                    ï r ( k ) = b r ( k - 1) + (a j ( k )) x ( k ) if w j ( k - 1) x ( k ) > 0,
                                     R                  R        2           2     T

          wj (k ) = í                                                                                     (13)
                    ï                ( y j ( k ) - a Lj ( k ) wTj ( k - 1) x ( k ) ) a Lj ( k ) x ( k )
                    ï w j ( k - 1) +                                                                    ,
                    ï                                              r L
                                                                       ( k )
                    ï L
                    î r ( k ) = b r ( k - 1) + (a j ( k )) x ( k ) otherwise.
                                     L                  L       2            2


    Algorithms (8), (13) describe learning process of neuron with adaptive parametric
rectified linear activation function in general.


4        Computer Experiments

To demonstrate the efficiency of the proposed neuron and its learning procedure it
was implemented a simulation test based on approximation of reference signal
defined by expression:
       y j ( k ) = tanh ( 0,1x1 ( k ) + 0, 2 x2 ( k ) + 0,3 x3 ( k ) + 0, 4 x4 ( k ) ) = tanh ( u j ( k ) )
  where xi ( k ) - is a uniformly distributed random variable on the interval:
-1 £ xi ( k ) £ 1. The results of the proposed approach were compared with the results
obtained using a neuron-Adaline, neuron with standart ReLU activation function and
neuron with classical tanh ( u j ( k ) ) activation function.
   In Fig.2 it is shown how the mean square error is changing
              2       1 n
           e j ( N ) = åe j 2 ( N - 1) = e j ( N - 1) +
                      N k =1
                                            2           1 2
                                                        N
                                                                       (
                                                          e j ( N ) - e j ( N - 1) .
                                                                         2
                                                                                                    )
   In this experiment the best results were gotten by neuron with AdPReLU activa-
tion function. So it surpasses neuron with Adaline, neuron with ReLU and another
                 (         )
one with tanh u j ( k ) . As reference signal was chosen such expressions as:
                                       y j ( k ) = sin ( 0,5p u j ( k ) ) ,
                                            ìï tanh u j ( k ) , if u j ( k ) > 0,
                                 y j (k ) = í
                                             ïî u j ( k ) ,othherwise,
                                                    3


                                              y = tanh ( u j ( k ) )
  the proposed neuron also overperforms Adaline, ReLU and tanh ( u j ( k ) ) .


                                                                                               (
Fig. 2. Convergence curves for training neurons with functions: y j ( k ) = sin 0,5p u j ( k ) ;              )
               Fig. 3. Convergence curves for training neurons with functions
                                       ìï tanh u j ( k ) , if u j ( k ) > 0;
                            y j (k ) = í
                                        ïî u j ( k ) , othherwise;
                                              3


                                                                                (    )
     Fig. 4. Convergence curves for training neurons with functions y = tanh u j ( k ) .


5      Conclusion

In this paper, formal neuron of neural network with adaptive activation function,
whose parameters tune simultaneously with synaptic weights, is introdused. Proposed
activation function is generalization of rectified unit family and provides improve-
ments of approximating properties. Usage of AdPReLU in deep neural networks pre-
vents the learning process from “vanishing and exploding gradients”. In spite of this
the proposed algorithms of tuning are optimized for the increasing of operating speed,
i.e. they significantly reduce the learning time of the network in general. Computa-
tional experiments confirm the effectiveness of the proposed approach.


References
 1. Cybenko, G.: Approximating by superposition of a sigmoidal function, Math.Contr. Sign.
    Syst, vol. 2, pp. 303–314, (1989).
 2. Cichocki, A., Unbehauen, R.: Neural Networks for Optimization and Signal Processing ,
    Stuttgart: Teubner, (1993).
 3. Hornik, K.: Approximation capabilities of multilayer feedforward networks, Neural Net-
    works, vol. 4, pp. 251–257, (1991).
 4. Bodyanskiy, Ye.V, Kulishova, N.Ye, Rudenko, O.G.: One model of formal neuron, Re-
    ports of National Academy of Sciences of Ukraine, vol. 4, pp. 69–73, (2001).
 5. Bengio, Y, LeCun, Y, Hinton, G.: Deep Learning, Nature, vol. 521, pp.436–444, (2015).
 6. Schmidhuber, J.: Deep learning in neural networks: An overview, Neural Networks, vol.
    61, pp. 82–117, (2015).
 7. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, MIT Press, (2016).
 8. Graupe, D.: Deep Learning Neural Networks: Design and Case Studies, New Jersey:
    World Scientific, (2016).
 9. Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convo-
    lution network, arXiv preprint arXiv, 1505.00853, (2015).
10. He, K., Zhang, X, Ren, S., Sun ,J.: Delving deep into rectifiers: Surpassing human-level
    performance on ImageNet classification, Proc. IEEE Int. Conf. on Computer Vision, arXiv
    prrprint arXiv: 1502.01852.2015, pp.1026–1034, (2015).
11. Clevert, D-A., Unterhiner, T., Hochreiter, S.: Fast and accurate deep network learning by
    exponential linear units (ELUs), arXiv preprint arXiv: 1511.07289, (2015).
12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, Proc.
    IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 770–778,
    (2016).
13. Kruschke, J.K., Movellan, J.R.: Benefits of gain: speeded learning and minimal layers
    backpropagation networks, IEEE Trans. on Syst., Man, and Cybern, vol. 21, pp. 273–280,
    (1991).
14. Kaczmarz, S.: Approximate solution of systems on linear equations, Int. J. Control, vol.
    53, pp. 1269–1271, (1993).
15. Widrow, B., Hoff, Jr. M. E.: Adaptive switching circuits, IRE Western Electric Show and
    Connection Record, Part 4, pp. 96–104, (1960).
16. Bodyanskiy, Ye.V., Pliss, I.P., Solovyova, T. V.: Multistep optimal predictors of multivar-
    iable non-stationary stochastic processes, Reports of Academy of Sciences of USSR, vol.
    12, pp. 47–49, (1986).
17. Bodyanskiy, Ye., Kolodyazhniy, V., Stephan A.: An adaptive learning algorithm for a neu-
    ro-fuzzy network, Ed. by B. Reusch “Computational Intelligence. Theory and Applica-
    tions”, Berlin Heidelberg: Springer-Verlag, pp. 68–75, (2001).
18. Otto, P., Bodyanskiy, Ye., Kolodyazhniy, V.: A new learning algorithm for a forecasting
    neuro-fuzzy network, Integrated Computer Aided Engineering, vol. 10, №4, pp. 399–409,
    (2003).
19. Goodwin G. C., Ramadge P. J., Caines P. E.: A globally convergent adaptive predictor,
    Automatica, vol. 17, pp.135–140, (1981).

</pre>