<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ReLU and sigmoidal activation functions</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Multilingual Speech Technologies, North-West University</institution>
          ,
          <addr-line>South Africa; and CAIR</addr-line>
          ,
          <country country="ZA">South Africa</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The generalization capabilities of deep neural networks are not well understood, and in particular, the influence of activation functions on generalization has received little theoretical attention. Phenomena such as vanishing gradients, node saturation and network sparsity have been identified as possible factors when comparing different activation functions [1]. We investigate these factors using fully connected feedforward networks on two standard benchmark problems, and find that the most salient differences between networks with sigmoidal and ReLU activations relate to the way that class-distinctive information is propagated through a network.</p>
      </abstract>
      <kwd-group>
        <kwd>Non-linear activation function</kwd>
        <kwd>Generalization</kwd>
        <kwd>Activation distribution</kwd>
        <kwd>Sparsity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The simplest class of neural networks are the multilayer perceptrons (MLPs) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
which consist of an input layer, one or more hidden layers and an output layer.
Each node in a previous layer is connected to all the nodes in the following layer
by a vector of weight values. These weight values are adjusted in the learning
process so that the network output approximates the labeled target value,
minimizing some loss function. To create a non-linear representation and allow the
network to learn complex non-linear problems, each node is followed by an
activation function that effectively “squishes” or rectifies the output of the node.
Two historically popular activation functions for deep neural networks are the
established sigmoidal function and the widely used rectified linear unit (ReLU) [
        <xref ref-type="bibr" rid="ref1 ref13">1,
13</xref>
        ]. Various other activation functions have been proposed, but none are clearly
superior to these functions; we therefore investigate only these two functions.
      </p>
      <p>
        Although several researchers have compared the performance of non-linear
activation functions in deep models [
        <xref ref-type="bibr" rid="ref1 ref12 ref2">1, 2, 12</xref>
        ] and the respective difficulties of
training DNNs with these activation functions have been established, a more
concrete understanding of their effect on the training and generalization process
is lacking. It is now widely understood that ReLU networks [
        <xref ref-type="bibr" rid="ref1 ref13 ref6">1, 6, 13</xref>
        ] are easy
to optimize because of their similarity to linear units, apart from ReLU units
outputting zero across half of their domain. This fact allows the gradient of a
rectified linear unit to remain not only large, but also constant whenever the
unit is active. A drawback of using ReLUs, however, is that they cannot learn
via gradient-based methods when a node output is 0 or less, since there is no
gradient [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Prior to the introduction of ReLUs, most DNNs used activation functions
called logistic sigmoid activations or hyperbolic tangent activations. Sigmoidal
units saturate across most of their domain: they saturate to a value of 1 when
the input is large and positive, and saturate to a value of 0 when the input is
large and negative [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The fact that a sigmoidal unit saturates over most of
its domain can make gradient-based learning difficult [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The gradient of an
unscaled sigmoidal function is always less than 1 and tempers off to 0 when
saturating. This causes a “vanishing gradients” problem when training deep
networks that use these activation functions, because fractions get multiplied
over several layers and gradients end up nearing zero.
      </p>
      <p>
        Thus, each of the popular activation functions faces certain difficulties
during training, and remedies have been developed to cope with these challenges.
The general consensus is that ReLU activations are empirically preferable to
sigmoidal units [
        <xref ref-type="bibr" rid="ref1 ref12 ref13 ref6">1, 6, 12, 13</xref>
        ], but the evidence in this regard is not overwhelming and
theoretical motivations for their superiority are weak. These theoretical
motivations focus heavily on the beneficial effects of sparse representations in hidden
layers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], without much evidence of its effect on network training and
generalization performance or the behavior of hidden nodes. Another strong theme
in previous comparisons of activation functions is that of “vanishing gradients”,
and although this phenomenon is visible and pertinent in many cases, any
discrepancies in generalization performance cannot fully be attributed to vanishing
gradients when sigmoidal networks are able to train to 100% classification
accuracy. Overall, we suspect that there are more factors that need to be considered
when comparing activation functions, specifically regarding the behavior of
hidden nodes. In this study we therefore revisit the training and generalization
performance of DNNs trained with ReLU and sigmoid activation functions. We
then investigate the effect of the activation function on the behavior of nodes
in hidden layers and how, for each of these functions, class information is
separated and propagated. We conclude by discussing the broader implications of
these findings.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Comparing performance</title>
      <p>
        We compare the performance of DNNs of varying width and varying depth on
several tasks. The two main activation functions of interest are the rectified linear
unit (ReLU) and the sigmoidal unit, as motivated in Section 1. We investigate
fully connected feed-forward neural networks on the MNIST [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and CIFAR10
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] datasets – convolutional networks are more commonly used on these tasks
and achieve higher accuracies, but we limit our attention to fully connected
networks in order to investigate the essential components of generalization in
DNNs. MNIST is a relatively simple dataset that contains 60 000 images of
handwritten digits ranging from 0 to 9, while CIFAR10 is a more complex dataset
that contains 60 000 images of 10 different objects, including: planes, ships, cats
and dogs. For each dataset we split the data into a training, validation and test
set. We trained several networks, as summarized below:
– Network depths of: 2, 4, 6 and 8 layers.
– Networks widths of: 200 and 800 nodes.
– With batch-normalization layers and without batch-normalization layers.
      </p>
      <p>
        We optimize the hyper-parameters that most strongly affect the model
convergence and generalization error, namely the learning rate and training seed.
Values for hyper-parameters such as the scale of initial random weights and
layer biases are selected after performing several initial experiments to
determine the combinations of hyper-parameters that give good convergence with
high validation accuracy. The optimizer used to train the neural networks is
Adam [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], due to its adaptive estimates of lower-order moments compared to
normal stochastic gradient descent. In all experiments, we optimize over three
different random seeds when searching for hyper-parameter values to increase
our degree of certainty that these values are acceptable. We use an iterative grid
search to determine appropriate learning rate values, and let the learning rate
decay with a factor of 0.99 after every epoch.
      </p>
      <p>
        We let all models train for 300 epochs. To regularize the network training,
early stopping is used. No other explicit regularization is added to the networks,
such as L1 and L2 norm penalties. We do however acknowledge the regularizing
effect of batch normalization [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Cross-entropy is used as loss function with a
softmax layer at the end of the network. We use Xavier [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] initialization for
sigmoid networks and He [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] initialization for ReLU networks. We ensure that
networks are sufficiently trained by tracking the training loss over epochs as well
as the training and validation accuracy. We also compare the performance of our
models to benchmark results for MNIST [
        <xref ref-type="bibr" rid="ref10 ref14">10, 14</xref>
        ] and CIFAR10 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to ensure that
our results are comparable. On the MNIST dataset, similar architectures reach
a test accuracy of 98.4%, while much larger networks than those shown in this
paper reach 56.84% on the CIFAR10 dataset.
      </p>
      <p>Figure 1 shows an example of the average learning and validation curves over
three seeds with standard error; results shown here are for networks trained on
the MNIST dataset. We observe that for different architecture configurations, the
training curve of the ReLU and sigmoid networks are similar while the validation
curves of the ReLU networks show better performance overall.</p>
      <p>The learning curves from Figure 1 are obtained on the training and validation
sets. The models performing best on the validation set are evaluated on the test
set. Evaluation accuracy on the test set is reported over different experiments
in Figure 2. From this figure we see that the ReLU networks generally perform
better than the sigmoid networks on the evaluation set. However, when the
network is wide enough, the performances of sigmoid and ReLU networks become
more similar when not using batch normalization. An interesting observation is
that when the sigmoid network has a constant width of 200 nodes, the application
of batch normalization results in an increase in performance with an increase in
depth. When not using batch normalization, the generalization performance of
the sigmoid networks decreases with an increase in depth.</p>
      <p>In the case of CIFAR10 from Figure 2(b), the ReLU networks only reliably
generalize better than the sigmoid networks when trained with batch
normalization. When not trained with batch normalization, the ReLU and sigmoid
networks generalize less well with an increase in depth. The sigmoid networks
perform similarly to ReLU networks in all configurations where batch
normalization is not used, except for the 6- and 8-layer networks that are 200 nodes
wide. The sigmoid networks generalize relatively poorly with these two
configurations compared to the other configurations. This poor generalization could
be attributed to the vanishing gradient problem since these two network
architectures struggle to fit the training set. In contrast, we see a general increase in
evaluation accuracy when increasing the number of hidden layers while training
with batch normalization.</p>
      <p>
        In summary, then, we observe that optimal generalization is obtained when
batch normalization is used to train wide ReLU networks; for CIFAR10, network
depth provides a small additional benefit. In contrast to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] we cannot ascribe
these benefits to the vanishing gradients problem, since all our networks, apart
from the 6- and 8-layer sigmoid networks from Figure 2(b), can train to virtually
perfect classification of the training set. An alternative explanation is therefore
required, and the next section investigates a number of clues related to such an
explanation.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Node distributions</title>
      <p>To better understand the effect of the activation functions on training and
generalization performance, we investigate the behavior of nodes after the activation
function is applied. We specifically look at the activation values for each class
at each node to investigate how class information is separated and propagated
throughout the network. The term “activation distribution” is used to refer to
the distribution of activation values for the samples of a class at a hidden node.
We choose to show results only for the 4-layer networks as they have moderate
depth and similar trends are observed in deeper and wider networks.
At each hidden node hi,j we calculate the class activation distributions (ac,i,j )xxcc,,1M
for each class c after applying the activation function T so that for a single
sample-node pair we get:
acm,i,j = T (
s(i−1)
k=0
wi,j,khi−1,k)
1 ≤ i ≤ N
with sample xc,m as input, and the distribution as:
(ac,i,j )xxcc,,1M = {ac1,i,j , ac2,i,j , acm,i,j , ..., acM ,i,j }
1 ≤ i ≤ N
1 ≤ c ≤ C
where c is the class index, i the layer index, j the node index and s the number
of nodes in a layer. xc,1 is the first sample in a class c and xc,M the last sample
(1)
(2)
with C the number of classes. For each class activation distribution (ac,i,j )xxcc,,1M
at each node we calculate the median and standard deviation of the distribution.</p>
      <p>When training a DNN with non-linear activation functions, there exists
points where the learning process is saturated and gradients reach zero or
nearzero values, depending on the activation function. For ReLU this saturation
point is at 0.0, where values equal to or below this point have zero gradient and
learning subsides for that specific sample at that node. The sigmoid function
has two of these saturation points, at 0.0 and 1.0 respectively. For the sigmoid
function, gradients never reach zero when nearing these saturation points, but
they become very small and any weight update with this gradient has almost
no effect. The position of the median with regard to the saturation points, and
with regard to medians from activation distributions of other classes at a node
provides an indication of how the class information is separated among
different classes. The standard deviation gives an indication of the variability of the
activation values and and the extent to which the majority of activation values
approach the saturation points.</p>
      <p>Figure 4 shows the typical output of specific hidden nodes in the trained
network after the ReLU activation function has been applied at the node. It is
clear that the activation distributions are cut off at 0.0 since activation values
below 0.0 are rectified while the positive values retain their magnitude. The
distributions in the positive domain retain some of the same uniform shape,
depending on how many samples have negative activation values. In the positive
domain, the activation distributions are not saturated and do not overlap as
heavily for different classes such as the output of nodes in Figure 3. Not only
do the nodes in deeper layers become more specialized towards one class, they
activate for fewer samples than the nodes in earlier layers. This confirms the
ability of ReLU trained networks to be sparse.</p>
      <p>To understand how class information is separated and propagated, we plot
the median of the activation distributions of each class at each node. Figures
5 through 7 show the median values for the distributions after the activation
function has been applied. The x-axis indexes the nodes in the networks. For
a 4x200 network, this means that nodes 1-200 make up layer one, nodes
201400 layer two, etc. Since the activation distributions of the sigmoid network are
highly non-uniform, we track the position of activation distributions by plotting
the median of the distributions. By using the median values, we regard the
distribution as a whole and not only the location of the distribution mass.</p>
      <p>For the untrained model, Figure 5 shows that the medians are mostly
clustered around 0.5 for the first hidden layer. The medians in the second hidden
layer (nodes 200-400) are slightly more saturated towards 1.0 and 0.0 and the
median values are starting to overlap more for classes at the same node. From
nodes 400 to 600 it is observed that the medians of the distributions overlap
significantly, meaning that with the untrained weights the nodes are not able
to distinguish activation values of one class from the others. In the last hidden
layer (nodes 600-800) it is observed that the activation distributions lay almost
directly on top of each other and class information cannot be separated. From
Figure 5 we see that when the model is sufficiently trained, class information is
separated by saturating medians towards 0.0 and 1.0. This effect is more clear at
deeper hidden layers (nodes 400-599 and 600-800), while more nodes in earlier
layers have activation distributions with median values slightly closer to 0.5.</p>
      <p>For ReLU networks the mean and the median are almost identical due to the
uniform shape of their activation distributions. Figure 6 shows the untrained
model for the network trained with ReLU activation functions. The median
values of node distributions are less saturated in earlier layers and saturate more
towards 0.0 for deeper layers. It is important to note that the activation
distributions in deeper layers do not overlap as strongly compared to the sigmoid
network in Figure 5. The ability of the ReLU network to completely suppress
class information and not activate for specific samples allows the network to
still separate distributions of different classes in deeper layers. Even when
distributions are not separated in a meaningful/learned way, the inherent “sparse”
structure that the ReLU activations introduce suggests better separation of class
information.</p>
      <p>From Figure 6 it is observed that when a ReLU network is sufficiently well
trained, the activation distributions in earlier layers have lower activation values
and class information is suppressed with overlap of activation distributions of
classes. This observation with the behavior seen in Figure 4 indicates that nodes
in earlier layers remain relatively agnostic towards classes. The nodes in deeper
layers have less overlap and nodes become more specialized towards specific
classes.</p>
      <p>Figure 7 shows the median values of activation distributions of networks
trained on the CIFAR10 dataset. If the network cannot effectively fit the more
complex data, class information cannot be effectively separated and the median
values look more similar to the untrained models in Figure 5 and Figure 6.</p>
      <p>
        Since the beneficial effects of sparsity in deep rectifier models have been
proposed as a mechanism for their performance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we calculate the sparsity of
one of our models trained with and without batch normalization. In Table 1 we
compare the sparsity of each layer for these two networks. The metric is
determined by counting the number of samples that are inactive for each node and
then averaging over nodes to get the sparsity metric for the layer. Networks that
are trained with batch normalization have fewer sparse interactions per layer,
meaning that nodes are (on average) active for more samples of a class. We know
however from Figure 2(a) and 2(b) that the use of batch normalization causes
the generalization performance of the ReLU networks to increase. This leads us
to believe that sparsity, although useful as proven by Glorot et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], does not
entirely describe the impressive generalization abilities of ReLU networks.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this study we compared the performance of different neural network
architectures trained with ReLU and sigmoid activation functions. We observed that
when networks are wide enough, sigmoid networks have comparable
generalization performance to ReLU networks. The ReLU networks benefit more from the
use of batch normalization than networks trained with sigmoid activations.</p>
      <p>We investigated and compared the behavior of nodes trained with these two
activation functions by looking at how class information is separated. It was
observed that while ReLU and sigmoid networks both are able to effectively
separate class information, they do this in fundamentally different ways. The
sigmoid networks saturate class information towards 0.0 and 1.0 where there is
plenty of overlap between activation distributions. The ReLU networks saturate
class information towards 0.0 for earlier layers with moderate overlap of class
distributions, making earlier layers more conservative towards class discrimination
while nodes in later layers become more specialized towards single classes. We
also show that when training a ReLU network with batch normalization, the
hidden layers have lower average sparsity but superior generalization performance
compared to ReLU networks trained without batch normalization.</p>
      <p>Overall we find that sparsity and saturation seem less pertinent than the way
in which class-distinctive information is propagated through the networks, and
how node behavior differs. We plan to relate the node behavior of DNNs trained
with different activation functions to their generalization performance in future
studies.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the National Research Foundation (NRF,
Grant Number 109243). Any opinions, findings, conclusions or recommendations
expressed in this material are those of the authors and the NRF does not accept
any liability in this regard.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Glorot</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Deep sparse rectifier neural networks</article-title>
          .
          <source>In: Proceedings of the fourteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research</source>
          , vol.
          <volume>15</volume>
          , pp.
          <fpage>315</fpage>
          -
          <lpage>323</lpage>
          . PMLR,
          <string-name>
            <surname>Fort</surname>
            <given-names>Lauderdale</given-names>
          </string-name>
          , FL, USA (
          <volume>11</volume>
          -
          <fpage>13</fpage>
          Apr
          <year>2011</year>
          ), http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Glorot</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Understanding the difficulty of training deep feedforward neural networks</article-title>
          .
          <source>In: Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research</source>
          , vol.
          <volume>9</volume>
          , pp.
          <fpage>249</fpage>
          -
          <lpage>256</lpage>
          . PMLR, Chia Laguna Resort, Sardinia, Italy (
          <volume>13</volume>
          -15 May
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep Learning</article-title>
          . MIT Press (
          <year>2016</year>
          ), http://www.deeplearningbook.org
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification</article-title>
          .
          <source>CoRR abs/1502</source>
          .
          <year>01852</year>
          (
          <year>2015</year>
          ), http://arxiv.org/abs/1502.01852
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ioffe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          .
          <source>CoRR abs/1502</source>
          .03167 (
          <year>2015</year>
          ), http://arxiv.org/abs/1502.03167
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jarrett</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>LeCun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>What is the best multi-stage architecture for object recognition? ICCV'</article-title>
          09 pp.
          <volume>16</volume>
          ,
          <issue>24</issue>
          ,
          <issue>27</issue>
          ,
          <issue>173</issue>
          ,
          <issue>192</issue>
          ,
          <issue>226</issue>
          ,
          <issue>363</issue>
          ,
          <issue>364</issue>
          ,
          <issue>525</issue>
          (
          <year>2009</year>
          ). https://doi.org/10.1109/ICCV.
          <year>2009</year>
          .
          <volume>5459469</volume>
          , http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-
          <volume>09</volume>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>International Conference on Learning Representations (12</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>CIFAR10 dataset</article-title>
          , https://www.cs.toronto.edu/ kriz/cifar.html
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>LeCun</surname>
          </string-name>
          , Y.,
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burges</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>MNIST database of handwritten digits</article-title>
          , http://yann.lecun.com/exdb/mnist/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lecun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haffner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Gradient-based learning applied to document recognition</article-title>
          .
          <source>Proceedings of the IEEE</source>
          <volume>86</volume>
          (
          <issue>11</issue>
          ),
          <fpage>2278</fpage>
          -
          <lpage>2324</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Memisevic</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konda</surname>
            ,
            <given-names>K.R.</given-names>
          </string-name>
          :
          <article-title>How far can we go without convolution: Improving fully-connected networks</article-title>
          .
          <source>CoRR abs/1511</source>
          .02580 (
          <year>2015</year>
          ), http://arxiv.org/abs/1511.02580
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Maas</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hannun</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          :
          <article-title>Rectifier nonlinearities improve neural network acoustic models (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Rectified linear units improve Restricted Boltzmann machines</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on International Conference on Machine Learning</source>
          . pp.
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
          . ICML'
          <volume>10</volume>
          (
          <year>2010</year>
          ), http://citeseerx.ist.psu.edu/viewdoc/download?doi
          <source>=10.1.1.165.6419</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Simard</surname>
          </string-name>
          , P.Y.,
          <string-name>
            <surname>Steinkraus</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Platt</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          :
          <article-title>Best practices for convolutional neural networks applied to visual document analysis</article-title>
          .
          <source>In: International Conference on Document Analysis and Recognition (ICDAR)</source>
          . vol.
          <volume>02</volume>
          , p.
          <volume>958</volume>
          (
          <issue>08</issue>
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>