=Paper=
{{Paper
|id=Vol-2533/invited2
|storemode=property
|title=Formal Neuron Based on Adaptive Parametric Rectified Linear Activation Function and its Learning
|pdfUrl=https://ceur-ws.org/Vol-2533/invited2.pdf
|volume=Vol-2533
|authors=Yevgeniy Bodyanskiy,Anastasiia Deineko,Iryna Pliss,Valeriia Slepanska
|dblpUrl=https://dblp.org/rec/conf/dcsmart/BodyanskiyDPS19
}}
==Formal Neuron Based on Adaptive Parametric Rectified Linear Activation Function and its Learning
==
Formal Neuron Based on Adaptive Parametric Rectified Linear Activation Function and its Learning Yevgeniy Bodyanskiy1[0000-0001-5418-2143], Anastasiia Deineko2[0000-0002-3279-3135], Iryna Pliss1[0000-0001-7918-7362] and Valeriia Slepanska2[0000-0002-0465-8593] 1 Control systems research laboratory, 2 Artificial Intelligence department, Kharkiv National University of Radio Electronics, Kharkiv, Ukraine yevgeniy.bodyanskiy@nure.ua anastasiya.deineko@gmail.com iryna.pliss@nure.ua valeriia.slepanskaia@gmail.com Abstract. The paper proposes an adaptive activation function (AdPReLU) for deep neural networks which is generalization of rectified unit family, differing by opportunity of online tuning its parameters during the learning process of neural network. The learning algorithm of formal neuron with adaptive activa- tion function which is generalization of delta-rule and in which parameters of the function tune simultaneously with synaptic weights, based on error back- propagation is developed. The proposed algorithm of tuning is optimized for in- creasing of operating speed. Computational experiments confirm the effective- ness of the approach under consideration. Keywords: deep neural network, adaptive activation function, delta-rule, syn- aptic weights, rectified linear unit, learning algorithm. 1 Introduction At the present time artificial neural networks are widely used for solving Data Science tasks, due to their possibility to tune parameters and architecture during the process of information processing and their universal approximative abilities. These properties provide effective solving the tasks of pattern recognition (classification), time series processing (prediction), complex non-linear objects and processes emulation (identifi- cation and adaptive control). The most widely used are multilayer perceptrons whose nodes-neurons usually are the Rosenblatt`s elementary perceptrons with sigmoidal activation functions. Besides traditional 𝜎-functions [1] the most widespread are tanh, SoftSign, Satlin [2, 3], poly- nomial activation functions of special type [4] and another squashing functions. Based on the classical multilayer perceptrons the deep neural networks (DNN) were created [5-8]. This has led to increasing of processing images, audio signals, arbitrary time series, and intelligent text analysis effectiveness. However, there are Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2019 DCSMart Workshop. significant computational problems connected with to the so-called vanishing and exploding gradients connected with specific form of sigmoidal activation functions. Consequently, so called rectified unit family [9] is used in DNN as activation func- tions. There can be noted such functions as leaky rectified linear unit (LReLU), para- metric restified linear unit (PReLU), randomized leaky rectified linear unit (RReLU), noisy rectified linear unit (NReLU), exponential linear unit (ELU) [7-12], except rectified linear unit (ReLU) itself. The functions listed above are piecewise linear functions with fixed parameters chosen by empirical considerations. The advantage is that their derivatives do not vanish, so they overcome the problem of vanishing gradient and permit to optimize the speed of learning process. However, these functions do not satisfy G.Cybenko`s theorem [1] conditions, so for providing required quality of approximation it is nec- essary to increase the number of hidden layers in the DNN. It causes increasing of DNN`s computational complexity and learning process speed decreasing. Accordingly, it is expedient to introduce in consideration adaptive parametric recti- fied linear activation function (AdPReLU) within rectified unit family, whose pa- rameters can tune during learning process like usual neuron`s synaptic weights do, optimizing adopted learning criterion and improving approximating properties both individual neuron and neural network in general. 2 Architecture of Neuron with Adaptive Parametric Rectified Linear Activation Function Rosenblatt's perceptron as node of any neural network implements nonlinear mapping as: æ n ö yˆ j ( k ) = y j ç q j 0 + åw ji xi ( k ) ÷ = è i =1 ø æ n ö = y j ç åw ji xi ( k ) ÷ = y j ( wTj x ( k ) ) = y j (u j ( k )) è i =0 ø where yˆ j ( k ) - output signal of j-th neuron of network in the moment of discrete time k = 1, 2, …; x ( k ) = (1, x1 ( k ) , …, xi ( k ) , …, xn ( k ) ) ∈ R ( n +1) – input vector signal, T q j 0 º w j 0 – bias signal, w j = ( w j 0 , w j1 , …, w ji , …, w jn ) T ∈ R ( n +1) – synaptic weights vector, adjusting in learning process, u j ( k ) − signal of internal activation, y j (×) – activation function of j-th neuron, chosen usually by empirical considerations during the process of learning and functioning of neural network. Thus, in the Cybenko's theorem 𝜎-function is used: 1 yˆ j ( k ) = y j (u j ( k )) = (1) 1 + exp ( -g j (u j ( k ) ) where g j is а gain parameter which determines the form of this function. It should be noticed that derivative of sigmoidal function has the form y j (u j ( k )) = g j yˆ j ( k ) (1 - yˆ j ( k ) ) that means that it has a form of bell-shaped function. Therefore, the more value of yˆ j ( k ) is closer to 0 or 1, the closer the value of derivative to 0, which originates the vanishing gradient. In general form, rectified unit family can be written as ïì u ( k ) if u j > 0, y j (u j ( k )) = í j (2) ïîa j u j ( k ) otherwise where the a j parameter is chosen by empirical considerations and stays constant during the learning process. In the standart ReLU a j equal to 0, so: y j (u j ( k )) = 0 if u j ( k ) < 0. This may lead to learning process being frozen because of negative values of inter- nal activation function`s signal. The generalization of activation function (2) has the form ìï a R u ( k ) if u j > 0, y j (u j ( k )) = í Lj j (3) ïîa j u j ( k ) otherwise, however, there is a problem with the a Rj and the a Lj parameters` values choosing. So, the solution is to introduce extra procedure of tuning these parameters to the neu- ron`s learning process. This makes the learning process more sophisticated and leads to necessity to tune n+3 parameters instead of n+1 adjustable parameters which are within the w j vector. In spite of that, improvement of approximating properties is provided, due to the fact that (3) can be performed in different forms, for example: y j ( u j ( k )) = u j ( k ) . There is the schema of neuron with adaptive parametric rectified linear activation function (shown on fig. 1) where parameters w j , a Rj , a Lj are tuned during the learning process. Fig. 1. Neuron with adaptive parametric rectified linear unit (AdPReLU). In Fig. 1 – y j ( k ) - external reference signal, e j ( k ) = y j ( k ) - yˆ j ( k ) = y j ( k ) - y j ( u j ( k ) ) – learning error. 3 Learning Procedure As learning criterion standard quadratic function is used in the form: 2 1æ æ n öö 1 1 ( ) E j ( k ) = e2j ( k ) = y j (k ) -y j ( u j ( k ) ) = ç y j ( k ) -y j ç åw ji xi ( k ) ÷ ÷ . 2 2 2 2è è i =0 øø Its minimization by gradient procedure leads to algorithm of synaptic weights` tun- ing that can be written in the form: ¶E ( k ) ¶e j ( k ) ¶e ( k ) w ji ( k ) = w ji ( k - 1) - h ( k ) j = w ji ( k ) -h ( k ) e j ( k ) j = ¶e j ( k ) ¶w ji ¶w ji ¶e j ( k ) ¶u j ( k ) = w ji ( k - 1) -h ( k ) e j ( k ) = ¶u j ( k ) ¶w ji = w ji ( k - 1) + h ( k ) e j ( k )y j (u j ( k )) xi ( k ) = = w ji ( k - 1) +h ( k ) d j ( k ) xi ( k ) or in the vector form: w j ( k ) = w j ( k - 1) + h ( k ) d j ( k ) xi ( k ) where h ( k ) - is a learning rate parameter, d j ( k ) = e j ( k )y ¢j u j ( k ) − ( ) -error. For standard hyperbolic tangent function it can be written as: ¶y ( u j ) ¶u j ( ) = g j 1 - ( tanh g j u j ) = g j ( sechg j u j ) = g j (1 - yˆ j 2 ) 2 w j ( k ) = w j ( k - 1) +h ( k ) e j ( k ) g j (1 - yˆ j 2 ( k ) ) x ( k ) . (4) Obviously, if yˆ j ( k ) ® ±1 «vanishing gradient» effect is appeared. For improve- ment of algorithm (4) convergence in [13] it was proposed to tune gain parameter g j according to the procedure: ¶y ( u j ) ¶u j ( ) = g j 1 - ( tanh g j u j ) = g j ( sechg j u j ) = g j (1 - yˆ j 2 ) 2 w j ( k ) = w j ( k - 1) +h ( k ) e j ( k ) g j (1 - yˆ j 2 ( k ) ) x ( k ) . (5) that also leads to «vanishing gradient» effect. Neuron`s learning scheme, which is shown in Fig. 1, using backpropagation proce- dure, begins with tuning of a Rj and a Lj parameters. To simplify the transformations lets skip the R and L indexes temporarily. Then ¶E j ( k ) a j ( k ) = a j ( k - 1) - ha ( k ) = a j ( k - 1) + ¶a j +ha ( k ) ( y j ( k ) - a j ( k - 1) u j ( k ) ) u j ( k ) = (6) = a j ( k - 1) + ha ( k ) ( y j ( k ) - a j ( k - 1) w ( k -1) x ( k ) ) w ( k - 1) x ( k ) . T j T j The parameter learning process of AdPReLU activation function (6) can be opti- mized for the increasing of operating speed. So it can be provided following trans- formations: a j ( k ) = a j ( k - 1) + ha ( k ) ( y j ( k ) - a j ( k -1) u j ( k ) ) u j ( k ) , a j ( k ) u j ( k ) = a j ( k - 1) u j ( k ) + ha ( k ) ( y j ( k ) - a j ( k - 1) u j ( k ) ) u 2j ( k ) , y j ( k ) - a j ( k ) u j ( k ) = y j ( k ) - a j ( k - 1) u j (k ) -ha (k ) e j (k ) u 2j (k ) , e! j ( k ) = e j ( k ) -ha ( k ) e j ( k ) u2j (k ) , e! j 2 ( k ) = e j 2 ( k ) - 2ha ( k ) e2j ( k ) u2j ( k ) +ha2 (k )e2j (k )u4j (k ), ¶e! j 2 ( k ) = -2e j 2 ( k ) u 2j ( k ) + 2ha ( k ) e 2j ( k ) u 4j ( k ) = 0, ¶ha which suggest that optimal value of learning rate parameter ha ( k ) is determined by expression: ha ( k ) = u-j 2 ( k ) . (7) Then by substitution (7) into (6) and returning to R and L indexes the next result was gotten: ì a Rj ( k ) = a Rj ( k - 1) + ( y j ( k ) - a Rj ( k - 1) u j ( k ) ) u -j 1 ( k ) if u j ( k ) > 0, ï (8) í L ï î a j ( k ) = a L j ( k - 1) + ( y j ( k ) - a L j ( k - 1) u j ( k ) ) u -1 j ( k ) otherwise . After a Rj and a Lj parameters` are tuned, it can be possible to return to w j synap- tic weights learning. In this case the learning criterion is based on eˆ j ( k ) error, i.e.: 1 1 E! j ( k ) = e! j 2 ( k ) = ( y j ( k ) - a j ( k ) wTj x ( k ) ) . 2 (9) 2 2 The gradient minimization (9) by w j leads to the procedure: w j ( k ) = w j ( k - 1) - h ( k ) Ñw j E! j ( k ) = w j ( k - 1) + h ( k ) e! j ( k ) a j ( k ) x ( k ) = = w j ( k - 1) + h ( k ) ( y j ( k ) - a j ( k ) wTj ( k - 1) x ( k ) ) a j ( k ) x ( k ) = (10) = w j ( k - 1) +h ( k ) ( y j ( k ) - wTj ( k - 1) x! ( k ) ) x! ( k ) where x! ( k ) = a j ( k ) x ( k ) . It`s simple to notice, that algorithm (10) is basically a learning procedure of neu- ron-Adaline [2], what means that it can be optimized by operating speed. As a result, we obtain optimized one-step Kaczmarz-Widrow-Hoff learning algorithm [14, 15] in the form: y j ( k ) - wTj ( k - 1) x! ( k ) w j ( k ) = w j ( k - 1) + x! ( k ) = x! ( k ) 2 (11) = w j ( k - 1) + e! j ( k ) x! +T (k ) where ( × ) - is a symbol of pseudoinversion. + For preventing from “exploding gradient”, regularized version of (11) can be con- sidered: w j ( k ) = w j ( k - 1) + ( x! ( k ) x! T ( k ) + a I ) e! j ( k ) x! ( k ) -1 where a > 0 – is a momentum term. Using matrix inversion lemma we can finally obtain expression: e! j ( k ) x! ( k ) w j ( k ) = w j ( k - 1) + , a + x! ( k ) 2 that coincides with the additive form of Kaczmarz`s algorithm. For providing additive filtering properties to learning algorithm (12), the procedure [16-18] can be used: e! j ( k ) x! ( k ) e! j ( k ) x! ( k ) w j ( k ) = w j ( k - 1) + = w j ( k ) = w j ( k - 1) + r (k ) b r ( k - 1) + x! ( k ) 2 (where 0 ≤ 𝛽 ≤1 – is a forgetting factor), which coincides with algorithm (11) if 𝛽=0. However, if 𝛽=1 it coincides with stochastic approximation algorithm of Good- win-Ramadge-Caines [19], which provides convergence in the conditions of stochas- tic disturbances and noises. Consequently, the resulting synaptic weights learning procedure can be written as: ì ï w j ( k - 1) + ( y j ( k ) - a Rj ( k ) wTj ( k - 1) x ( k ) ) a Rj ( k ) x ( k ) , ï rR (k ) ï R ï r ( k ) = b r ( k - 1) + (a j ( k )) x ( k ) if w j ( k - 1) x ( k ) > 0, R R 2 2 T wj (k ) = í (13) ï ( y j ( k ) - a Lj ( k ) wTj ( k - 1) x ( k ) ) a Lj ( k ) x ( k ) ï w j ( k - 1) + , ï r L ( k ) ï L î r ( k ) = b r ( k - 1) + (a j ( k )) x ( k ) otherwise. L L 2 2 Algorithms (8), (13) describe learning process of neuron with adaptive parametric rectified linear activation function in general. 4 Computer Experiments To demonstrate the efficiency of the proposed neuron and its learning procedure it was implemented a simulation test based on approximation of reference signal defined by expression: y j ( k ) = tanh ( 0,1x1 ( k ) + 0, 2 x2 ( k ) + 0,3 x3 ( k ) + 0, 4 x4 ( k ) ) = tanh ( u j ( k ) ) where xi ( k ) - is a uniformly distributed random variable on the interval: -1 £ xi ( k ) £ 1. The results of the proposed approach were compared with the results obtained using a neuron-Adaline, neuron with standart ReLU activation function and neuron with classical tanh ( u j ( k ) ) activation function. In Fig.2 it is shown how the mean square error is changing 2 1 n e j ( N ) = åe j 2 ( N - 1) = e j ( N - 1) + N k =1 2 1 2 N ( e j ( N ) - e j ( N - 1) . 2 ) In this experiment the best results were gotten by neuron with AdPReLU activa- tion function. So it surpasses neuron with Adaline, neuron with ReLU and another ( ) one with tanh u j ( k ) . As reference signal was chosen such expressions as: y j ( k ) = sin ( 0,5p u j ( k ) ) , ìï tanh u j ( k ) , if u j ( k ) > 0, y j (k ) = í ïî u j ( k ) ,othherwise, 3 y = tanh ( u j ( k ) ) the proposed neuron also overperforms Adaline, ReLU and tanh ( u j ( k ) ) . ( Fig. 2. Convergence curves for training neurons with functions: y j ( k ) = sin 0,5p u j ( k ) ; ) Fig. 3. Convergence curves for training neurons with functions ìï tanh u j ( k ) , if u j ( k ) > 0; y j (k ) = í ïî u j ( k ) , othherwise; 3 ( ) Fig. 4. Convergence curves for training neurons with functions y = tanh u j ( k ) . 5 Conclusion In this paper, formal neuron of neural network with adaptive activation function, whose parameters tune simultaneously with synaptic weights, is introdused. Proposed activation function is generalization of rectified unit family and provides improve- ments of approximating properties. Usage of AdPReLU in deep neural networks pre- vents the learning process from “vanishing and exploding gradients”. In spite of this the proposed algorithms of tuning are optimized for the increasing of operating speed, i.e. they significantly reduce the learning time of the network in general. Computa- tional experiments confirm the effectiveness of the proposed approach. References 1. Cybenko, G.: Approximating by superposition of a sigmoidal function, Math.Contr. Sign. Syst, vol. 2, pp. 303–314, (1989). 2. Cichocki, A., Unbehauen, R.: Neural Networks for Optimization and Signal Processing , Stuttgart: Teubner, (1993). 3. Hornik, K.: Approximation capabilities of multilayer feedforward networks, Neural Net- works, vol. 4, pp. 251–257, (1991). 4. Bodyanskiy, Ye.V, Kulishova, N.Ye, Rudenko, O.G.: One model of formal neuron, Re- ports of National Academy of Sciences of Ukraine, vol. 4, pp. 69–73, (2001). 5. Bengio, Y, LeCun, Y, Hinton, G.: Deep Learning, Nature, vol. 521, pp.436–444, (2015). 6. Schmidhuber, J.: Deep learning in neural networks: An overview, Neural Networks, vol. 61, pp. 82–117, (2015). 7. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, MIT Press, (2016). 8. Graupe, D.: Deep Learning Neural Networks: Design and Case Studies, New Jersey: World Scientific, (2016). 9. Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convo- lution network, arXiv preprint arXiv, 1505.00853, (2015). 10. He, K., Zhang, X, Ren, S., Sun ,J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, Proc. IEEE Int. Conf. on Computer Vision, arXiv prrprint arXiv: 1502.01852.2015, pp.1026–1034, (2015). 11. Clevert, D-A., Unterhiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs), arXiv preprint arXiv: 1511.07289, (2015). 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, (2016). 13. Kruschke, J.K., Movellan, J.R.: Benefits of gain: speeded learning and minimal layers backpropagation networks, IEEE Trans. on Syst., Man, and Cybern, vol. 21, pp. 273–280, (1991). 14. Kaczmarz, S.: Approximate solution of systems on linear equations, Int. J. Control, vol. 53, pp. 1269–1271, (1993). 15. Widrow, B., Hoff, Jr. M. E.: Adaptive switching circuits, IRE Western Electric Show and Connection Record, Part 4, pp. 96–104, (1960). 16. Bodyanskiy, Ye.V., Pliss, I.P., Solovyova, T. V.: Multistep optimal predictors of multivar- iable non-stationary stochastic processes, Reports of Academy of Sciences of USSR, vol. 12, pp. 47–49, (1986). 17. Bodyanskiy, Ye., Kolodyazhniy, V., Stephan A.: An adaptive learning algorithm for a neu- ro-fuzzy network, Ed. by B. Reusch “Computational Intelligence. Theory and Applica- tions”, Berlin Heidelberg: Springer-Verlag, pp. 68–75, (2001). 18. Otto, P., Bodyanskiy, Ye., Kolodyazhniy, V.: A new learning algorithm for a forecasting neuro-fuzzy network, Integrated Computer Aided Engineering, vol. 10, №4, pp. 399–409, (2003). 19. Goodwin G. C., Ramadge P. J., Caines P. E.: A globally convergent adaptive predictor, Automatica, vol. 17, pp.135–140, (1981).