<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Formal Neuron Based on Adaptive Parametric Rectified Linear Activation Function and its Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>niy Bo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>nskiy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Artificial Intelligence department, Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Control systems research laboratory</institution>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>The paper proposes an adaptive activation function (AdPReLU) for deep neural networks which is generalization of rectified unit family, differing by opportunity of online tuning its parameters during the learning process of neural network. The learning algorithm of formal neuron with adaptive activation function which is generalization of delta-rule and in which parameters of the function tune simultaneously with synaptic weights, based on error backpropagation is developed. The proposed algorithm of tuning is optimized for increasing of operating speed. Computational experiments confirm the effectiveness of the approach under consideration.</p>
      </abstract>
      <kwd-group>
        <kwd>deep neural network</kwd>
        <kwd>adaptive activation function</kwd>
        <kwd>delta-rule</kwd>
        <kwd>synaptic weights</kwd>
        <kwd>rectified linear unit</kwd>
        <kwd>learning algorithm</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>At the present time artificial neural networks are widely used for solving Data Science
tasks, due to their possibility to tune parameters and architecture during the process of
information processing and their universal approximative abilities. These properties
provide effective solving the tasks of pattern recognition (classification), time series
processing (prediction), complex non-linear objects and processes emulation
(identification and adaptive control).</p>
      <p>The most widely used are multilayer perceptrons whose nodes-neurons usually are
the Rosenblatt`s elementary perceptrons with sigmoidal activation functions. Besides
traditional -functions [1] the most widespread are tanh, SoftSign, Satlin [2, 3],
polynomial activation functions of special type [4] and another squashing functions.</p>
      <p>Based on the classical multilayer perceptrons the deep neural networks (DNN)
were created [5-8]. This has led to increasing of processing images, audio signals,
arbitrary time series, and intelligent text analysis effectiveness. However, there are
significant computational problems connected with to the so-called vanishing and
exploding gradients connected with specific form of sigmoidal activation functions.</p>
      <p>Consequently, so called rectified unit family [9] is used in DNN as activation
functions. There can be noted such functions as leaky rectified linear unit (LReLU),
parametric restified linear unit (PReLU), randomized leaky rectified linear unit (RReLU),
noisy rectified linear unit (NReLU), exponential linear unit (ELU) [7-12], except
rectified linear unit (ReLU) itself.</p>
      <p>The functions listed above are piecewise linear functions with fixed parameters
chosen by empirical considerations. The advantage is that their derivatives do not
vanish, so they overcome the problem of vanishing gradient and permit to optimize
the speed of learning process. However, these functions do not satisfy G.Cybenko`s
theorem [1] conditions, so for providing required quality of approximation it is
necessary to increase the number of hidden layers in the DNN. It causes increasing of
DNN`s computational complexity and learning process speed decreasing.</p>
      <p>Accordingly, it is expedient to introduce in consideration adaptive parametric
rectified linear activation function (AdPReLU) within rectified unit family, whose
parameters can tune during learning process like usual neuron`s synaptic weights do,
optimizing adopted learning criterion and improving approximating properties both
individual neuron and neural network in general.
2</p>
      <p>Architecture of Neuron with Adaptive Parametric Rectified
Linear Activation Function
Rosenblatt's perceptron as node of any neural network implements nonlinear mapping
as:</p>
      <p>ae n ö
yˆ j (k ) =y j çq j0 + åwji xi (k ) ÷ =</p>
      <p>è i=1 ø
ae n ö
=y j ç åwji xi (k ) ÷ =y j ( wTj x (k )) =y j (u j (k ))</p>
      <p>è i=0 ø
where yˆ j (k ) - output signal of j-th neuron of network in the moment of discrete
time k = 1, 2,…; x (k )= (1, x1 (k ) ,…, xi (k ),…, xn (k ))T ∈ R(n+1) – input vector signal,
q j0 º w j0 – bias signal, w j = ( wj0 , wj1,…, wji ,…, wjn ) T ∈ R(n+1) – synaptic weights
vector, adjusting in learning process, uj (k ) − signal of internal activation, y j (×) –
activation function of j-th neuron, chosen usually by empirical considerations during
the process of learning and functioning of neural network.</p>
      <p>Thus, in the Cybenko's theorem -function is used:
yˆ j (k ) =y j (u j (k )) =</p>
      <p>
        1
1 + exp (-g j (u j (k ))
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where g j is а gain parameter which determines the form of this function.
It should be noticed that derivative of sigmoidal function has the form
y j (u j (k )) = g j yˆ j (k ) (1 - yˆ j (k ))
that means that it has a form of bell-shaped function. Therefore, the more value of
yˆ j (k ) is closer to 0 or 1, the closer the value of derivative to 0, which originates the
vanishing gradient.
      </p>
      <p>In general form, rectified unit family can be written as</p>
      <p>ìï u j (k )if u j &gt; 0,
y j (u j (k )) = í</p>
      <p>ïîa ju j (k )otherwise
where the a j parameter is chosen by empirical considerations and stays constant
during the learning process. In the standart ReLU a j equal to 0, so:</p>
      <p>y j (uj (k )) = 0 if uj (k ) &lt; 0.</p>
      <p>This may lead to learning process being frozen because of negative values of
internal activation function`s signal.</p>
      <p>
        The generalization of activation function (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) has the form
      </p>
      <p>
        ìï a Rju j (k ) if u j &gt; 0,
y j (u j (k )) = íïîa Lju j (k )otherwise,
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
however, there is a problem with the a Rj and the a Lj parameters` values choosing.
So, the solution is to introduce extra procedure of tuning these parameters to the
neuron`s learning process. This makes the learning process more sophisticated and leads
to necessity to tune n+3 parameters instead of n+1 adjustable parameters which are
within the w j vector. In spite of that, improvement of approximating properties is
provided, due to the fact that (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) can be performed in different forms, for example:
y j (u j (k )) = u j (k ) .
      </p>
      <p>There is the schema of neuron with adaptive parametric rectified linear activation
function (shown on fig. 1) where parameters wj , a Rj , a Lj are tuned during the learning
process.</p>
      <p>In Fig. 1 – y j (k ) - external reference signal, ej (k ) =
y j (k ) - yˆ j (k ) = y j (k )
y j (u j (k )) – learning error.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Learning Procedure</title>
      <p>As learning criterion standard quadratic function is used in the form:
wji (k ) = wji (k -1) -h (k )</p>
      <p>1 2 1 ae ae n ö ö
E j (k ) = 12 e2j (k ) = ( y j (k) -y j (u j (k ))) = ç y j (k ) -y j ç åwji xi (k )÷ ÷ .</p>
      <p>2 2 è è i=0 ø ø</p>
      <p>Its minimization by gradient procedure leads to algorithm of synaptic weights`
tuning that can be written in the form:
¶E j (k ) ¶e j (k ) ¶e j (k )
= w ji (k ) -h (k ) e j (k )
¶w ji
=
2
¶e j (k ) ¶w ji
= wji (k -1) -h (k ) e j (k )
= wji (k -1) +h (k ) e j (k )y j (u j (k ))xi (k ) =
= w ji (k -1) +h (k )d j (k ) xi (k )
¶e j (k ) ¶u j (k )
¶u j (k ) ¶wji
=
or in the vector form:</p>
      <p>wj (k ) = wj (k -1) +h (k )d j (k ) xi (k )
where h (k ) - is a learning rate parameter, d j (k ) = e j (k )y ¢j (u j (k )) −
-error.</p>
      <p>For standard hyperbolic tangent function it can be written as:
¶y (u j ) = g j (1 - ( tanhg ju j )2 ) = g j (sechg ju j ) = g j (1 - yˆ j2 )
¶u j
wj (k ) = wj (k -1) +h (k ) e j (k )g j (1 - yˆ j2 (k )) x (k ).</p>
      <p>
        Obviously, if yˆ j (k ) ® ±1 «vanishing gradient» effect is appeared. For
improvement of algorithm (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) convergence in [13] it was proposed to tune gain parameter g j
according to the procedure:
¶y (u j ) = g j (1 - ( tanhg ju j )2 ) = g j (sechg ju j ) = g j (1 - yˆ j2 )
¶u j
      </p>
      <p>wj (k ) = wj (k -1) +h (k ) e j (k )g j (1 - yˆ j2 (k )) x (k ).
that also leads to «vanishing gradient» effect.</p>
      <p>Neuron`s learning scheme, which is shown in Fig. 1, using backpropagation
procedure, begins with tuning of a Rj and a Lj parameters. To simplify the transformations
lets skip the R and L indexes temporarily.</p>
      <p>
        Then
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
= a j (k -1) +ha (k ) ( y j (k ) - a j (k -1) wTj (k -1) x (k )) wTj (k -1) x (k ).
      </p>
      <p>
        The parameter learning process of AdPReLU activation function (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) can be
optimized for the increasing of operating speed. So it can be provided following
transformations:
      </p>
      <p>a j (k ) = a j (k -1) +ha (k ) ( y j (k ) - a j (k -1) u j (k )) u j (k ),
a j (k ) u j (k ) = a j (k -1) u j (k ) +ha (k ) ( y j (k ) - a j (k -1)u j (k ))u 2j (k ),
y j (k ) - aj (k )uj (k ) = y j (k ) - a j (k -1)uj (k ) -ha (k )ej (k )u2j (k ),</p>
      <p>e!j (k ) = ej (k ) -ha (k )ej (k )u2j (k ) ,
e!j2 (k ) = ej2 (k ) - 2ha (k )e2j (k )u2j (k ) +ha2(k)e2j(k)u4j(k),
¶e! j2 (k )
¶ha</p>
      <p>= -2ej2 (k ) u 2j (k ) + 2ha (k ) e 2j (k ) u 4j (k ) = 0,
a j (k ) = a j (k -1) -ha (k )</p>
      <p>= a j (k -1) +
+ha (k ) ( y j (k ) - a j (k - 1) u j (k )) u j (k ) =
¶E j (k )</p>
      <p>¶a j
E! j (k ) = 12 e! j2 (k ) = ( y j (k ) - a j (k ) wTj x (k )) .</p>
      <p>
        1 2
2
The gradient minimization (
        <xref ref-type="bibr" rid="ref9">9</xref>
        ) by w j leads to the procedure:
wj (k ) = wj (k -1) -h (k ) Ñwj E! j (k ) = wj (k -1) +h (k ) e!j (k ) a j (k ) x (k ) =
= wj (k -1) +h (k ) ( y j (k ) - a j (k ) wTj (k -1) x (k )) a j (k ) x (k ) =
      </p>
      <p>= wj (k -1) +h (k ) ( y j (k ) - wTj (k -1) x! (k )) x! (k )
where x! (k ) = aj (k ) x (k ) .</p>
      <p>
        It`s simple to notice, that algorithm (
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) is basically a learning procedure of
neuron-Adaline [2], what means that it can be optimized by operating speed. As a result,
we obtain optimized one-step Kaczmarz-Widrow-Hoff learning algorithm [14, 15] in
the form:
      </p>
      <p>which suggest that optimal value of learning rate parameter ha (k ) is determined
by expression:</p>
      <p>ha (k ) = u-j2 (k ).</p>
      <p>
        Then by substitution (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) into (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) and returning to R and L indexes the next result
was gotten:
ìïa Rj (k ) = a Rj (k -1) + ( y j (k ) - a Rj (k -1)u j (k ))u -j1 (k )if u j (k ) &gt; 0,
ïíîa Lj (k ) = a Lj (k -1) + ( y j (k ) - a Lj (k -1)u j (k ))u -j1 (k )otherwise.
      </p>
      <p>
        After a Rj and a Lj parameters` are tuned, it can be possible to return to wj
synaptic weights learning. In this case the learning criterion is based on eˆj (k ) error, i.e.:
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
where (×)+ - is a symbol of pseudoinversion.
      </p>
      <p>
        For preventing from “exploding gradient”, regularized version of (
        <xref ref-type="bibr" rid="ref11">11</xref>
        ) can be
considered:
      </p>
      <p>wj (k ) = wj (k -1) + ( x! (k ) x!T (k ) +a I )-1 e! j (k ) x! (k )
where a &gt; 0 – is a momentum term. Using matrix inversion lemma we can finally
obtain expression:</p>
      <p>
        wj (k ) = wj (k -1) + ae! j+(kx!) (x!k()k )2 ,
wj (k ) = wj (k -1) + y j (k ) - wx! Tj(k( k) -21) x! (k ) x! (k ) =
= wj (k -1) + e! j (k ) x! +T (k )
(
        <xref ref-type="bibr" rid="ref11">11</xref>
        )
(
        <xref ref-type="bibr" rid="ref13">13</xref>
        )
that coincides with the additive form of Kaczmarz`s algorithm.
      </p>
      <p>
        For providing additive filtering properties to learning algorithm (
        <xref ref-type="bibr" rid="ref12">12</xref>
        ), the procedure
[16-18] can be used:
wj (k ) = wj (k -1) + e! j (k ) x! (k ) = wj (k ) = wj (k -1) +
r (k )
      </p>
      <p>
        e! j (k ) x! (k )
b r (k -1) + x! (k ) 2
(where 0 ≤  ≤1 – is a forgetting factor), which coincides with algorithm (
        <xref ref-type="bibr" rid="ref11">11</xref>
        ) if
=0. However, if =1 it coincides with stochastic approximation algorithm of
Goodwin-Ramadge-Caines [19], which provides convergence in the conditions of
stochastic disturbances and noises.
      </p>
      <p>Consequently, the resulting synaptic weights learning procedure can be written as:
ì ( y j (k ) - a Rj (k ) wTj (k -1) x (k )) a Rj (k ) x (k )
ïwj (k -1) + ,
ï rR (k )
ïïrR (k ) = b rR (k -1) + (a Rj (k ))2 x (k ) 2 if wTj (k -1) x (k ) &gt; 0,
wj (k ) = í
ï ( y j (k ) - a Lj (k ) wTj (k -1) x (k )) a Lj (k ) x (k )
ïwj (k -1) + ,
ï rL (k )
ïrL (k ) = b rL (k -1) + (a Lj (k ))2 x (k ) 2 otherwise.</p>
      <p>î</p>
      <p>
        Algorithms (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ), (
        <xref ref-type="bibr" rid="ref13">13</xref>
        ) describe learning process of neuron with adaptive parametric
rectified linear activation function in general.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Computer Experiments</title>
      <p>To demonstrate the efficiency of the proposed neuron and its learning procedure it
was implemented a simulation test based on approximation of reference signal
defined by expression:</p>
      <p>y j (k ) = tanh (0,1x1 (k ) + 0, 2x2 (k ) + 0, 3x3 (k ) + 0, 4x4 (k )) = tanh (u j (k ))
where xi (k ) - is a uniformly distributed random variable on the interval:
-1 £ xi (k ) £ 1. The results of the proposed approach were compared with the results
obtained using a neuron-Adaline, neuron with standart ReLU activation function and
neuron with classical tanh (u j (k )) activation function.</p>
      <p>In Fig.2 it is shown how the mean square error is changing
e j 2 ( N ) = 1 åne j2 ( N -1) = e j 2 (N -1) + 1 (e j2 ( N ) - e j 2 ( N - 1)) .</p>
      <p>N k=1 N</p>
      <p>In this experiment the best results were gotten by neuron with AdPReLU
activation function. So it surpasses neuron with Adaline, neuron with ReLU and another
one with tanh (u j (k )) . As reference signal was chosen such expressions as:
y j (k ) = sin (0,5p u j (k )) ,</p>
      <p>ìïtanh u j (k ),if u j (k ) &gt; 0,
y j (k ) = í
îï u j3 (k ) ,othherwise,
y = tanh (u j (k ))
the proposed neuron also overperforms Adaline, ReLU and tanh (u j (k )).
In this paper, formal neuron of neural network with adaptive activation function,
whose parameters tune simultaneously with synaptic weights, is introdused. Proposed
activation function is generalization of rectified unit family and provides
improvements of approximating properties. Usage of AdPReLU in deep neural networks
prevents the learning process from “vanishing and exploding gradients”. In spite of this
the proposed algorithms of tuning are optimized for the increasing of operating speed,
i.e. they significantly reduce the learning time of the network in general.
Computational experiments confirm the effectiveness of the proposed approach.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cybenko</surname>
          </string-name>
          , G.:
          <article-title>Approximating by superposition of a sigmoidal function</article-title>
          ,
          <source>Math.Contr. Sign. Syst</source>
          , vol.
          <volume>2</volume>
          , pp.
          <fpage>303</fpage>
          -
          <lpage>314</lpage>
          , (
          <year>1989</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cichocki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Unbehauen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <source>Neural Networks for Optimization and Signal Processing</source>
          , Stuttgart: Teubner, (
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hornik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Approximation capabilities of multilayer feedforward networks</article-title>
          ,
          <source>Neural Networks</source>
          , vol.
          <volume>4</volume>
          , pp.
          <fpage>251</fpage>
          -
          <lpage>257</lpage>
          , (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bodyanskiy</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
          </string-name>
          .V,
          <string-name>
            <surname>Kulishova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <surname>Rudenko</surname>
            ,
            <given-names>O.G.</given-names>
          </string-name>
          :
          <article-title>One model of formal neuron</article-title>
          ,
          <source>Reports of National Academy of Sciences of Ukraine</source>
          , vol.
          <volume>4</volume>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>73</lpage>
          , (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y</given-names>
          </string-name>
          , LeCun,
          <string-name>
            <surname>Y</surname>
          </string-name>
          , Hinton, G.:
          <article-title>Deep Learning</article-title>
          ,
          <source>Nature</source>
          , vol.
          <volume>521</volume>
          , pp.
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Deep learning in neural networks: An overview</article-title>
          ,
          <source>Neural Networks</source>
          , vol.
          <volume>61</volume>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>117</lpage>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep Learning</article-title>
          , MIT Press, (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Graupe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <source>Deep Learning Neural Networks: Design and Case Studies</source>
          , New Jersey: World Scientific, (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Empirical evaluation of rectified activations in convolution network</article-title>
          ,
          <source>arXiv preprint arXiv</source>
          ,
          <volume>1505</volume>
          .
          <fpage>00853</fpage>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X</given-names>
          </string-name>
          , Ren,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Sun</surname>
          </string-name>
          ,J.:
          <article-title>Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification</article-title>
          ,
          <source>Proc. IEEE Int. Conf. on Computer Vision</source>
          , arXiv prrprint arXiv:
          <fpage>1502</fpage>
          .
          <year>01852</year>
          .
          <year>2015</year>
          , pp.
          <fpage>1026</fpage>
          -
          <lpage>1034</lpage>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Clevert</surname>
            ,
            <given-names>D-A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Unterhiner</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Fast and accurate deep network learning by exponential linear units (ELUs)</article-title>
          ,
          <source>arXiv preprint arXiv: 1511.07289</source>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR)</source>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kruschke</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Movellan</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Benefits of gain: speeded learning and minimal layers backpropagation networks</article-title>
          ,
          <source>IEEE Trans. on Syst., Man, and Cybern</source>
          , vol.
          <volume>21</volume>
          , pp.
          <fpage>273</fpage>
          -
          <lpage>280</lpage>
          , (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kaczmarz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Approximate solution of systems on linear equations</article-title>
          ,
          <source>Int. J. Control</source>
          , vol.
          <volume>53</volume>
          , pp.
          <fpage>1269</fpage>
          -
          <lpage>1271</lpage>
          , (
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Widrow</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoff</surname>
          </string-name>
          , Jr. M. E.:
          <article-title>Adaptive switching circuits</article-title>
          ,
          <source>IRE Western Electric Show and Connection Record, Part</source>
          <volume>4</volume>
          , pp.
          <fpage>96</fpage>
          -
          <lpage>104</lpage>
          , (
          <year>1960</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Bodyanskiy</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
          </string-name>
          .V.,
          <string-name>
            <surname>Pliss</surname>
            ,
            <given-names>I.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solovyova</surname>
            ,
            <given-names>T. V.</given-names>
          </string-name>
          :
          <article-title>Multistep optimal predictors of multivariable non-stationary stochastic processes</article-title>
          ,
          <source>Reports of Academy of Sciences of USSR</source>
          , vol.
          <volume>12</volume>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>49</lpage>
          , (
          <year>1986</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Bodyanskiy</surname>
          </string-name>
          , Ye.,
          <string-name>
            <surname>Kolodyazhniy</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stephan</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>An adaptive learning algorithm for a neuro-fuzzy network</article-title>
          , Ed. by B.
          <source>Reusch “Computational Intelligence. Theory and Applications”</source>
          , Berlin Heidelberg: Springer-Verlag, pp.
          <fpage>68</fpage>
          -
          <lpage>75</lpage>
          , (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Otto</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bodyanskiy</surname>
          </string-name>
          , Ye.,
          <string-name>
            <surname>Kolodyazhniy</surname>
          </string-name>
          , V.:
          <article-title>A new learning algorithm for a forecasting neuro-fuzzy network, Integrated Computer Aided Engineering</article-title>
          , vol.
          <volume>10</volume>
          , №4, pp.
          <fpage>399</fpage>
          -
          <lpage>409</lpage>
          , (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Goodwin</surname>
            <given-names>G. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramadge</surname>
            <given-names>P. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caines</surname>
            <given-names>P. E.</given-names>
          </string-name>
          :
          <article-title>A globally convergent adaptive predictor</article-title>
          ,
          <source>Automatica</source>
          , vol.
          <volume>17</volume>
          , pp.
          <fpage>135</fpage>
          -
          <lpage>140</lpage>
          , (
          <year>1981</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>