-

Formal Neuron Based on Adaptive Parametric Rectified Linear Activation Function and its Learning

niy Bo

nskiy

1 0 Artificial Intelligence department, Kharkiv National University of Radio Electronics , Kharkiv , Ukraine 1 Control systems research laboratory

0000 0001

The paper proposes an adaptive activation function (AdPReLU) for deep neural networks which is generalization of rectified unit family, differing by opportunity of online tuning its parameters during the learning process of neural network. The learning algorithm of formal neuron with adaptive activation function which is generalization of delta-rule and in which parameters of the function tune simultaneously with synaptic weights, based on error backpropagation is developed. The proposed algorithm of tuning is optimized for increasing of operating speed. Computational experiments confirm the effectiveness of the approach under consideration.

deep neural network adaptive activation function delta-rule synaptic weights rectified linear unit learning algorithm

At the present time artificial neural networks are widely used for solving Data Science tasks, due to their possibility to tune parameters and architecture during the process of information processing and their universal approximative abilities. These properties provide effective solving the tasks of pattern recognition (classification), time series processing (prediction), complex non-linear objects and processes emulation (identification and adaptive control).

The most widely used are multilayer perceptrons whose nodes-neurons usually are the Rosenblatt`s elementary perceptrons with sigmoidal activation functions. Besides traditional -functions [1] the most widespread are tanh, SoftSign, Satlin [2, 3], polynomial activation functions of special type [4] and another squashing functions.

Based on the classical multilayer perceptrons the deep neural networks (DNN) were created [5-8]. This has led to increasing of processing images, audio signals, arbitrary time series, and intelligent text analysis effectiveness. However, there are significant computational problems connected with to the so-called vanishing and exploding gradients connected with specific form of sigmoidal activation functions.

Consequently, so called rectified unit family [9] is used in DNN as activation functions. There can be noted such functions as leaky rectified linear unit (LReLU), parametric restified linear unit (PReLU), randomized leaky rectified linear unit (RReLU), noisy rectified linear unit (NReLU), exponential linear unit (ELU) [7-12], except rectified linear unit (ReLU) itself.

The functions listed above are piecewise linear functions with fixed parameters chosen by empirical considerations. The advantage is that their derivatives do not vanish, so they overcome the problem of vanishing gradient and permit to optimize the speed of learning process. However, these functions do not satisfy G.Cybenko`s theorem [1] conditions, so for providing required quality of approximation it is necessary to increase the number of hidden layers in the DNN. It causes increasing of DNN`s computational complexity and learning process speed decreasing.

Accordingly, it is expedient to introduce in consideration adaptive parametric rectified linear activation function (AdPReLU) within rectified unit family, whose parameters can tune during learning process like usual neuron`s synaptic weights do, optimizing adopted learning criterion and improving approximating properties both individual neuron and neural network in general. 2

Architecture of Neuron with Adaptive Parametric Rectified Linear Activation Function Rosenblatt's perceptron as node of any neural network implements nonlinear mapping as:

ae n ö yˆ j (k ) =y j çq j0 + åwji xi (k ) ÷ =

è i=1 ø ae n ö =y j ç åwji xi (k ) ÷ =y j ( wTj x (k )) =y j (u j (k ))

è i=0 ø where yˆ j (k ) - output signal of j-th neuron of network in the moment of discrete time k = 1, 2,…; x (k )= (1, x1 (k ) ,…, xi (k ),…, xn (k ))T ∈ R(n+1) – input vector signal, q j0 º w j0 – bias signal, w j = ( wj0 , wj1,…, wji ,…, wjn ) T ∈ R(n+1) – synaptic weights vector, adjusting in learning process, uj (k ) − signal of internal activation, y j (×) – activation function of j-th neuron, chosen usually by empirical considerations during the process of learning and functioning of neural network.

Thus, in the Cybenko's theorem -function is used: yˆ j (k ) =y j (u j (k )) =

1 1 + exp (-g j (u j (k )) ( 1 ) where g j is а gain parameter which determines the form of this function. It should be noticed that derivative of sigmoidal function has the form y j (u j (k )) = g j yˆ j (k ) (1 - yˆ j (k )) that means that it has a form of bell-shaped function. Therefore, the more value of yˆ j (k ) is closer to 0 or 1, the closer the value of derivative to 0, which originates the vanishing gradient.

In general form, rectified unit family can be written as

ìï u j (k )if u j > 0, y j (u j (k )) = í

ïîa ju j (k )otherwise where the a j parameter is chosen by empirical considerations and stays constant during the learning process. In the standart ReLU a j equal to 0, so:

y j (uj (k )) = 0 if uj (k ) < 0.

This may lead to learning process being frozen because of negative values of internal activation function`s signal.

The generalization of activation function ( 2 ) has the form

ìï a Rju j (k ) if u j > 0, y j (u j (k )) = íïîa Lju j (k )otherwise, ( 2 ) ( 3 ) however, there is a problem with the a Rj and the a Lj parameters` values choosing. So, the solution is to introduce extra procedure of tuning these parameters to the neuron`s learning process. This makes the learning process more sophisticated and leads to necessity to tune n+3 parameters instead of n+1 adjustable parameters which are within the w j vector. In spite of that, improvement of approximating properties is provided, due to the fact that ( 3 ) can be performed in different forms, for example: y j (u j (k )) = u j (k ) .

There is the schema of neuron with adaptive parametric rectified linear activation function (shown on fig. 1) where parameters wj , a Rj , a Lj are tuned during the learning process.

In Fig. 1 – y j (k ) - external reference signal, ej (k ) = y j (k ) - yˆ j (k ) = y j (k ) y j (u j (k )) – learning error. 3

Learning Procedure

As learning criterion standard quadratic function is used in the form: wji (k ) = wji (k -1) -h (k )

1 2 1 ae ae n ö ö E j (k ) = 12 e2j (k ) = ( y j (k) -y j (u j (k ))) = ç y j (k ) -y j ç åwji xi (k )÷ ÷ .

2 2 è è i=0 ø ø

Its minimization by gradient procedure leads to algorithm of synaptic weights` tuning that can be written in the form: ¶E j (k ) ¶e j (k ) ¶e j (k ) = w ji (k ) -h (k ) e j (k ) ¶w ji = 2 ¶e j (k ) ¶w ji = wji (k -1) -h (k ) e j (k ) = wji (k -1) +h (k ) e j (k )y j (u j (k ))xi (k ) = = w ji (k -1) +h (k )d j (k ) xi (k ) ¶e j (k ) ¶u j (k ) ¶u j (k ) ¶wji = or in the vector form:

wj (k ) = wj (k -1) +h (k )d j (k ) xi (k ) where h (k ) - is a learning rate parameter, d j (k ) = e j (k )y ¢j (u j (k )) − -error.

For standard hyperbolic tangent function it can be written as: ¶y (u j ) = g j (1 - ( tanhg ju j )2 ) = g j (sechg ju j ) = g j (1 - yˆ j2 ) ¶u j wj (k ) = wj (k -1) +h (k ) e j (k )g j (1 - yˆ j2 (k )) x (k ).

Obviously, if yˆ j (k ) ® ±1 «vanishing gradient» effect is appeared. For improvement of algorithm ( 4 ) convergence in [13] it was proposed to tune gain parameter g j according to the procedure: ¶y (u j ) = g j (1 - ( tanhg ju j )2 ) = g j (sechg ju j ) = g j (1 - yˆ j2 ) ¶u j

wj (k ) = wj (k -1) +h (k ) e j (k )g j (1 - yˆ j2 (k )) x (k ). that also leads to «vanishing gradient» effect.

Neuron`s learning scheme, which is shown in Fig. 1, using backpropagation procedure, begins with tuning of a Rj and a Lj parameters. To simplify the transformations lets skip the R and L indexes temporarily.

Then ( 4 ) ( 5 ) = a j (k -1) +ha (k ) ( y j (k ) - a j (k -1) wTj (k -1) x (k )) wTj (k -1) x (k ).

The parameter learning process of AdPReLU activation function ( 6 ) can be optimized for the increasing of operating speed. So it can be provided following transformations:

a j (k ) = a j (k -1) +ha (k ) ( y j (k ) - a j (k -1) u j (k )) u j (k ), a j (k ) u j (k ) = a j (k -1) u j (k ) +ha (k ) ( y j (k ) - a j (k -1)u j (k ))u 2j (k ), y j (k ) - aj (k )uj (k ) = y j (k ) - a j (k -1)uj (k ) -ha (k )ej (k )u2j (k ),

e!j (k ) = ej (k ) -ha (k )ej (k )u2j (k ) , e!j2 (k ) = ej2 (k ) - 2ha (k )e2j (k )u2j (k ) +ha2(k)e2j(k)u4j(k), ¶e! j2 (k ) ¶ha

= -2ej2 (k ) u 2j (k ) + 2ha (k ) e 2j (k ) u 4j (k ) = 0, a j (k ) = a j (k -1) -ha (k )

= a j (k -1) + +ha (k ) ( y j (k ) - a j (k - 1) u j (k )) u j (k ) = ¶E j (k )

¶a j E! j (k ) = 12 e! j2 (k ) = ( y j (k ) - a j (k ) wTj x (k )) .

1 2 2 The gradient minimization ( 9 ) by w j leads to the procedure: wj (k ) = wj (k -1) -h (k ) Ñwj E! j (k ) = wj (k -1) +h (k ) e!j (k ) a j (k ) x (k ) = = wj (k -1) +h (k ) ( y j (k ) - a j (k ) wTj (k -1) x (k )) a j (k ) x (k ) =

= wj (k -1) +h (k ) ( y j (k ) - wTj (k -1) x! (k )) x! (k ) where x! (k ) = aj (k ) x (k ) .

It`s simple to notice, that algorithm ( 10 ) is basically a learning procedure of neuron-Adaline [2], what means that it can be optimized by operating speed. As a result, we obtain optimized one-step Kaczmarz-Widrow-Hoff learning algorithm [14, 15] in the form:

which suggest that optimal value of learning rate parameter ha (k ) is determined by expression:

ha (k ) = u-j2 (k ).

Then by substitution ( 7 ) into ( 6 ) and returning to R and L indexes the next result was gotten: ìïa Rj (k ) = a Rj (k -1) + ( y j (k ) - a Rj (k -1)u j (k ))u -j1 (k )if u j (k ) > 0, ïíîa Lj (k ) = a Lj (k -1) + ( y j (k ) - a Lj (k -1)u j (k ))u -j1 (k )otherwise.

After a Rj and a Lj parameters` are tuned, it can be possible to return to wj synaptic weights learning. In this case the learning criterion is based on eˆj (k ) error, i.e.: ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) where (×)+ - is a symbol of pseudoinversion.

For preventing from “exploding gradient”, regularized version of ( 11 ) can be considered:

wj (k ) = wj (k -1) + ( x! (k ) x!T (k ) +a I )-1 e! j (k ) x! (k ) where a > 0 – is a momentum term. Using matrix inversion lemma we can finally obtain expression:

wj (k ) = wj (k -1) + ae! j+(kx!) (x!k()k )2 , wj (k ) = wj (k -1) + y j (k ) - wx! Tj(k( k) -21) x! (k ) x! (k ) = = wj (k -1) + e! j (k ) x! +T (k ) ( 11 ) ( 13 ) that coincides with the additive form of Kaczmarz`s algorithm.

For providing additive filtering properties to learning algorithm ( 12 ), the procedure [16-18] can be used: wj (k ) = wj (k -1) + e! j (k ) x! (k ) = wj (k ) = wj (k -1) + r (k )

e! j (k ) x! (k ) b r (k -1) + x! (k ) 2 (where 0 ≤ ≤1 – is a forgetting factor), which coincides with algorithm ( 11 ) if =0. However, if =1 it coincides with stochastic approximation algorithm of Goodwin-Ramadge-Caines [19], which provides convergence in the conditions of stochastic disturbances and noises.

Consequently, the resulting synaptic weights learning procedure can be written as: ì ( y j (k ) - a Rj (k ) wTj (k -1) x (k )) a Rj (k ) x (k ) ïwj (k -1) + , ï rR (k ) ïïrR (k ) = b rR (k -1) + (a Rj (k ))2 x (k ) 2 if wTj (k -1) x (k ) > 0, wj (k ) = í ï ( y j (k ) - a Lj (k ) wTj (k -1) x (k )) a Lj (k ) x (k ) ïwj (k -1) + , ï rL (k ) ïrL (k ) = b rL (k -1) + (a Lj (k ))2 x (k ) 2 otherwise.

Algorithms ( 8 ), ( 13 ) describe learning process of neuron with adaptive parametric rectified linear activation function in general. 4

Computer Experiments

To demonstrate the efficiency of the proposed neuron and its learning procedure it was implemented a simulation test based on approximation of reference signal defined by expression:

y j (k ) = tanh (0,1x1 (k ) + 0, 2x2 (k ) + 0, 3x3 (k ) + 0, 4x4 (k )) = tanh (u j (k )) where xi (k ) - is a uniformly distributed random variable on the interval: -1 £ xi (k ) £ 1. The results of the proposed approach were compared with the results obtained using a neuron-Adaline, neuron with standart ReLU activation function and neuron with classical tanh (u j (k )) activation function.

In Fig.2 it is shown how the mean square error is changing e j 2 ( N ) = 1 åne j2 ( N -1) = e j 2 (N -1) + 1 (e j2 ( N ) - e j 2 ( N - 1)) .

N k=1 N

In this experiment the best results were gotten by neuron with AdPReLU activation function. So it surpasses neuron with Adaline, neuron with ReLU and another one with tanh (u j (k )) . As reference signal was chosen such expressions as: y j (k ) = sin (0,5p u j (k )) ,

ìïtanh u j (k ),if u j (k ) > 0, y j (k ) = í îï u j3 (k ) ,othherwise, y = tanh (u j (k )) the proposed neuron also overperforms Adaline, ReLU and tanh (u j (k )). In this paper, formal neuron of neural network with adaptive activation function, whose parameters tune simultaneously with synaptic weights, is introdused. Proposed activation function is generalization of rectified unit family and provides improvements of approximating properties. Usage of AdPReLU in deep neural networks prevents the learning process from “vanishing and exploding gradients”. In spite of this the proposed algorithms of tuning are optimized for the increasing of operating speed, i.e. they significantly reduce the learning time of the network in general. Computational experiments confirm the effectiveness of the proposed approach.

1. Cybenko , G.: Approximating by superposition of a sigmoidal function , Math.Contr. Sign. Syst , vol. 2 , pp. 303 - 314 , ( 1989 ).

2. Cichocki , A. , Unbehauen , R. : Neural Networks for Optimization and Signal Processing , Stuttgart: Teubner, ( 1993 ).

3. Hornik , K. : Approximation capabilities of multilayer feedforward networks , Neural Networks , vol. 4 , pp. 251 - 257 , ( 1991 ).

4. Bodyanskiy , Ye .V, Kulishova , N. Ye , Rudenko , O.G. : One model of formal neuron , Reports of National Academy of Sciences of Ukraine , vol. 4 , pp. 69 - 73 , ( 2001 ).

5. Bengio , Y , LeCun, Y , Hinton, G.: Deep Learning , Nature , vol. 521 , pp. 436 - 444 , ( 2015 ).

6. Schmidhuber , J. : Deep learning in neural networks: An overview , Neural Networks , vol. 61 , pp. 82 - 117 , ( 2015 ).

7. Goodfellow , I. , Bengio , Y. , Courville , A. : Deep Learning , MIT Press, ( 2016 ).

8. Graupe , D. : Deep Learning Neural Networks: Design and Case Studies , New Jersey: World Scientific, ( 2016 ).

9. Xu , B. , Wang , N. , Chen , T. , Li , M. : Empirical evaluation of rectified activations in convolution network , arXiv preprint arXiv , 1505 . 00853 , ( 2015 ).

10. He , K. , Zhang , X , Ren, S. , Sun ,J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification , Proc. IEEE Int. Conf. on Computer Vision , arXiv prrprint arXiv: 1502 . 01852 . 2015 , pp. 1026 - 1034 , ( 2015 ).

11. Clevert , D-A. , Unterhiner , T. , Hochreiter , S. : Fast and accurate deep network learning by exponential linear units (ELUs) , arXiv preprint arXiv: 1511.07289 , ( 2015 ).

12. He , K. , Zhang , X. , Ren , S. , Sun , J.: Deep residual learning for image recognition , Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR) , pp. 770 - 778 , ( 2016 ).

13. Kruschke , J.K. , Movellan , J.R. : Benefits of gain: speeded learning and minimal layers backpropagation networks , IEEE Trans. on Syst., Man, and Cybern , vol. 21 , pp. 273 - 280 , ( 1991 ).

14. Kaczmarz , S. : Approximate solution of systems on linear equations , Int. J. Control , vol. 53 , pp. 1269 - 1271 , ( 1993 ).

15. Widrow , B. , Hoff , Jr. M. E.: Adaptive switching circuits , IRE Western Electric Show and Connection Record, Part 4 , pp. 96 - 104 , ( 1960 ).

16. Bodyanskiy , Ye .V., Pliss , I.P. , Solovyova , T. V. : Multistep optimal predictors of multivariable non-stationary stochastic processes , Reports of Academy of Sciences of USSR , vol. 12 , pp. 47 - 49 , ( 1986 ).

17. Bodyanskiy , Ye., Kolodyazhniy , V. , Stephan

: An adaptive learning algorithm for a neuro-fuzzy network , Ed. by B. Reusch “Computational Intelligence. Theory and Applications” , Berlin Heidelberg: Springer-Verlag, pp. 68 - 75 , ( 2001 ).

18. Otto , P. , Bodyanskiy , Ye., Kolodyazhniy , V.: A new learning algorithm for a forecasting neuro-fuzzy network, Integrated Computer Aided Engineering , vol. 10 , №4, pp. 399 - 409 , ( 2003 ).

19. Goodwin

G. C.

, Ramadge

P. J.

, Caines

P. E.

: A globally convergent adaptive predictor , Automatica , vol. 17 , pp. 135 - 140 , ( 1981 ).