<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Activation gap generators in neural networks</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Multilingual Speech Technologies, North-West University</institution>
          ,
          <addr-line>South Africa; and CAIR</addr-line>
          ,
          <country country="ZA">South Africa</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>No framework exists that can explain and predict the generalisation ability of DNNs in general circumstances. In fact, this question has not been addressed for some of the least complicated of neural network architectures: fully-connected feedforward networks with ReLU activations and a limited number of hidden layers. Building on recent work [2] that demonstrates the ability of individual nodes in a hidden layer to draw class-specific activation distributions apart, we show how a simplified network architecture can be analysed in terms of these activation distributions, and more specifically, the sample distances or activation gaps each node produces. We provide a theoretical perspective on the utility of viewing nodes as activation gap generators, and define the gap conditions that are guaranteed to result in perfect classification of a set of samples. We support these conclusions with empirical results.</p>
      </abstract>
      <kwd-group>
        <kwd>Generalisation</kwd>
        <kwd>fully-connected feedforward networks</kwd>
        <kwd>activation distributions</kwd>
        <kwd>MLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Deep Neural Networks (DNNs) have been used to achieve excellent
performance on many traditionally difficult machine learning tasks, especially
highdimensional classification tasks such as computer vision, speech recognition and
machine translation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. DNNs generalise well: trained on a limited data set, they
are able to transfer this learning to unseen inputs in a demonstrably effective
way. Despite various approaches to studying this process [
        <xref ref-type="bibr" rid="ref1 ref11 ref12 ref13 ref14 ref3 ref6 ref7 ref8">1, 3, 6–8, 11–14</xref>
        ], no
framework yet exists that can explain and predict this generalisation ability of
DNNs in general circumstances.
      </p>
      <p>
        Specifically, one of the central tenets of statistical learning theory links model
capacity (the complexity of the hypothesis space the model represents) with
expected generalisation performance [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. However, a sufficiently large DNN
represents an extremely large hypothesis space, specified by hundreds of thousands of
trainable parameters. While any architecture that has a hypothesis space that is
sufficiently large to be able to memorise random noise is not expected to
generalise well, this is not the case for DNNs. In a paper that caused much controversy,
Zhang et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] demonstrated how both Convolutional Neural Networks (CNNs)
and standard Multilayer Perceptrons (MLPs) are able to memorise noise
perfectly, while extracting the signal buried within the noise with the same efficiency
as if the noise was not present. Even more pointedly, this was shown to occur
with or without adding regularisation [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        In this work we take a step back, and analyse the classification ability of a
simplified neural network architecture, since even for a minimal network (as soon
as it has multiple layers and non-linear activation functions) generalisation
behaviour has not been fully characterised. In recent work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] we showed how the
individual nodes of a standard fully-connected feedforward network can draw
class-specific activation distributions apart, to the extent that these
distributions can be used to train individual likelihood estimators and produce accurate
classifications at each layer of the network. Here we show that the ability of a
fully-connected feedforward network to generalise can be analysed in terms of
these activation distributions, and more specifically the distances or ‘activation
gaps’ each node produces: the difference in activation value of any two samples
at a node.
      </p>
      <p>The main contribution of this paper is the development of a conceptual tool
that can be used to probe the generalisation ability of a DNN, and the
empirical confirmation of the soundness of the approach when applied to a simplified
MLP architecture. We start by reviewing the concept of node-specific sample
sets and nodes as likelihood estimators (Section 2) before introducing activation
gaps (Section 3) and exploring the role of activation gaps in achieving perfect
classification from a theoretical perspective. Expected relationships are
empirically confirmed in Section 4, and possible applications of these ideas as well as
future work briefly discussed in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Nodes as network elements</title>
      <p>
        While a network functions as a single construct, each layer also acts as a
functional element, and within a layer, each node has both a local and global function.
Nodes are locally attuned to extracting information from a very specific part of
the input space, while collaborating globally to solve the overall classification
or regression task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this section we review the local nature of nodes: their
role as individual likelihood estimators, their relationship with the samples that
activate them (their ‘sample sets’), and the node-specific interaction between
sample sets and the weights that feed into a specific node.
      </p>
      <p>From this point onwards, we restrict our discussion to fully-connected
feedforward architectures with ReLU activations and unrestricted breadth (number
of nodes in a hidden layer) and depth (number of hidden layers), applied to a
classification task.
2.1</p>
      <sec id="sec-2-1">
        <title>Nodes as estimators</title>
        <p>
          The network behaviour for an architecture as described above was analysed
in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Generalisation ability was interpreted as arising from two collaborative
information processing systems, one continuous and one discrete. Specifically, it
was shown how the network both represents a discrete system that only utilises
information with regard to whether a node is active or not, and a continuous
system that utilises the actual pre-activation value at each node. In both systems,
each node implicitly represents the likelihood of an observation of each class: in
the discrete case, this likelihood can be estimated by counting the number of
times a node activates for a given class; in the continuous case, by fitting a
density estimator to the pre-activation distribution of each class. In both cases,
the posterior probability (given the observation) can be calculated using Bayes
rule, and the probabilities multiplied across all nodes of any layer to produce a
layer-specific class prediction for each sample.
        </p>
        <p>
          An example of this is depicted in Figure 1, where an MLP trained and tested
on FMNIST [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is analysed at each layer, in the manner described above. During
classification, each discrete system only uses information with regard to whether
a node is activated or not; each continuous system only uses pre-activation
values; and the combined system uses the true information available to the
ReLUactivated network (either the continuous or discrete estimate). For this network,
as for all other networks of sufficient size analysed, it was observed that at some
layer, the network is able to achieve similar classification accuracy as the actual
network, irrespective of the system used to perform classification [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>This is quite surprising. In both systems, each node therefore uses locally
available information to develop its own probability estimate in isolation, but
then collaborates with other nodes to solve the classification task. Using this
perspective, the set of samples that activates any node (its sample set) becomes
very significant. During gradient descent, the forward process applies weights
to create sample sets; the backward process uses sample sets to update weights:
each weight attuned to only its specific sample set. Sample sets can be very large
(consisting of almost all samples) to very specialised, describing a few isolated
samples.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Node-supported cost</title>
        <p>
          The interplay between sample sets and weight updates was further investigated
in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]: rewriting the weight update Δwis,j,k associated with a single sample s at
layer i from node k to node j in an iterative form, produces a surprisingly simple
structure. Specifically, noting that
        </p>
        <p>
          Relu(x) = xT (x)
where T (x) =
1 if x &gt; 0
0 if x ≤ 0
and with η the (potentially adaptive) learning rate, ai−1,k the activation result
at layer i − 1 for node k, λsm the cost function at output node m, zi,j the sum
of the input to node j in layer i, and I(i, j) an indexing function (see [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]), the
weight update associated with a single sample can be written as:
(1)
(2)
(3)
(4)
(5)
(6)
This sum can either be calculated over {.}i,j , or over {.}i,j ∩ {.}i−1,k as only
samples that are active at node k will contribute to the sum, either way.
        </p>
        <p>The node-supported cost is a scalar value that represents the portion of the
final cost that can be attributed to all active paths initiated from this node,
when processing sample s. Note that φsa does not differentiate between a node
that creates paths with large positive and negative costs that balance out, and
one that produces a cost close to zero. Also, a positive cost implies too much
activation, a negative cost, too little.</p>
        <p>Δwis,j,k = −η ai−1,k</p>
        <p>Bi−1
b=0</p>
        <p>s
λI(N,b)</p>
        <p>N
r=i+1</p>
        <p>N−1
g=i</p>
        <p>T (zg,I(g,b))
wr,I(r,b),I(r−1,b)
with Bi a product of the the number of nodes in all layers following after layer
i. If we now define the sample-specific node-supported cost at layer i, node j as:
s
φi,j =</p>
        <p>Bi−1
b=0</p>
        <p>N−1
g=i</p>
        <p>N
r=i+1
s
λI(N,b)</p>
        <p>T (zgs,I(g,b))
wr,I(r,b),I(r−1,b)
then the weight update by a single sample can be written as
and over all samples (in the mini batch) used when computing the update:
δwis,j,k = −ηais−1,j,kφi,j</p>
        <p>s
δwi,j,k = −η</p>
        <p>s s
ai−1,j,kφi,j
s∈{.}i,j</p>
      </sec>
      <sec id="sec-2-3">
        <title>Weights as directions</title>
        <p>From the above discussion, each node vector is updated in response to the errors
remaining in its sample set. For some samples, activation values would be too
high, for others too low. Per node, the process of updating the node vector can
be viewed as one of finding a direction in its input space (the output of the
previous layer), such that samples producing different errors are separated when
calculating the projection of the sample onto the weight.</p>
        <p>With sample sets in mind, training can then be viewed as a process of
finding important directions in the layer-specific input space, projecting the original
features in each of these directions to create a transformed input space, and
repeating the process layer for layer. Important directions are those useful for
classification: class-distinctive in the initial layers, class-specific in the later
layers. It is important to note that this optimisation process is both local and
global: the direction of this node vector is optimised specifically for the sample
set concerned, and only the sample set concerned, resulting in a local process,
but the errors used to guide the process (the node-supported cost) is calculated
globally.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Gaps and ratios</title>
      <p>In order to develop a theoretical perspective on the interplay between the weight
vectors, node-supported cost and sample sets, we constrain our initial analysis to
a much restricted architecture. This allows us to determine the exact theoretical
conditions for perfect classification of a set of samples.
3.1</p>
      <sec id="sec-3-1">
        <title>A simplified architecture</title>
        <p>While an MLP is typically viewed as an input and output layer flanking any
number of hidden layers, each hidden layer can also be seen as a small 3-layer
subnetwork in its own right: utilising the output from the prior layer as input,
and trying to address the loss (the node-supported cost of Section 2.2) passed
along from its immediate output layer, the next hidden layer. As a starting
point for our analysis, we therefore restrict ourselves to the setup of such a
3layer subnetwork: considering only a single functional hidden layer in addition
to an input and output layer. The term ‘functional’ is used to emphasise that
only in this single hidden layer are nodes trained (in the standard manner) to
act as generalising elements.</p>
        <p>An additional hidden layer is added between the functional hidden layer and
the output as a summarising element: this layer contains only two nodes, in
order to summarise the activations produced by the functional hidden layer for
analysis purposes; as illustrated in Figure 2.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Theoretical requirements for perfect classification</title>
        <p>Consider a network that only has two nodes in its last hidden layer. At least
two nodes are required to be able to differentiate among classes, but two nodes
are sufficient to differentiate among an unlimited number of classes. Consider
two samples cs and ct from two different classes c and c , respectively. Limit the
nodes in the last hidden layer to j and k, let ajcs be the activation value of the
sample cs from the jth node, and wjc the weight from node j to the output node
associated with class c.</p>
        <p>We can now define two useful values: the activation gap α and weight gap
φ. The activation gap is defined as the difference between activation values at a
node:
We therefore use αjcc as a shorthand for αjcsc t , remembering that all α values
are specific to a sample pair, and not a single value per class. The weight gap
φ is defined as the difference between weight values anchored at the same node,
and terminating in the outer layer:
The weight gaps are not sample-specific, and truly has a single value for each
node and pair of classes. These definitions are illustrated in Figure 3. For any
two samples from any two different classes c and c to be classified correctly, the
following must hold:</p>
        <p>ajcwjc + akcwkc &gt; ajcwjc + akcwkc
ajc wjc + akc wkc &gt; ajc wjc + akc wkc</p>
        <p>ajcwjc + akcwkc &gt; 0
ajc wjc + akc wkc &gt; 0
αjcc = ajcs − ajc t
αkcc = akcs − akc t
φjcc = wjc − wjc
φkcc = wkc − wkc
(7)
(8)
(9)
(10)
(11)</p>
        <p>ajcwjc + akcwkc &gt; ajcwjc + akcwkc
−ajc wjc − akc wkc &gt; −ajc wjc − akc wkc
=⇒ ajcwjc + akcwkc − ajc wjc − akc wkc &gt; ajcwjc + akcwkc − ajc wjc − akc wkc
(13)
(14)
=⇒ αjcc wjc + αkcc wkc &gt; αjcc wjc + αkcc wkc
=⇒ αjcc wjc − αjcc wjc &gt; αkcc wkc − αkcc wkc</p>
        <p>=⇒ αjcc φjcc &gt; αkcc φkc c
This does not mean that there is any asymmetry to the roles of j and k. The
ratio can be rewritten in different ways, as the following are all equivalent:
αjcc φjcc &gt; αkcc φkc c
αjcc φjcc &gt; −αkcc φkcc
αkcc φkcc &gt; αjcc φjc c</p>
        <p>ajcφjcc + akcφkcc &gt; 0
ajc φjcc + akc φkcc &lt; 0
φjcc φkc c &gt; 0
Using the definition of Equation 8 in Equations 9 and 10, it follows that
which means that, since the activation values ≥ 0, one of φjcc or φkcc must
always be negative, and the other positive; which is similar to requiring that
φjcc or φkc c must be positive (since φkcc = −φkc c). Combining this information
with Equation 13, it will always hold for correctly classified samples that:
and the weight ratio as
or
(17)
(18)
(19)
(20)
(21)
(22)
Following the reverse process (not shown here), Equations 15 and 16 then
become requirements for correct classification. This requirement becomes clearer if
Equation 16 is restated as a ratio. Specifically, we define the activation ratio as
Taking signs into account, correct classification then requires that, in addition
to Equation 15, either
α-ratiojkcc =
φ-ratiojkcc =
αjcc
αkcc
φkc c
φjcc
αjcc αkcc &lt; 0
φjcc αkcc &lt; 0
αjcc αkcc &gt; 0
α-ratiojkcc
&gt; φ-ratiojkcc if φjcc αkcc &gt; 0
&lt; φ-ratiojkcc if φjcc αkcc &lt; 0
which is required to hold for all samples of all classes classified correctly, for every
single c and c pair. Note that the weight ratio is fixed for all samples, once the
network has been trained, and is determined solely by the weight values in the
output layer. The role of the nodes up to the last hidden layer is then to create
activation gaps between samples, consistent with the established weight ratios.
Consistency either requires a specific sign assignment, or in other cases, a very
specific ratio relationship. The weight ratios will always be positive, and since
the activation ratios under the conditions of Equation 21 are also always positive,
it is the absolute values of these ratios that matter. Since each gap is created
simply by summing over all active nodes in the previous layer, nodes that are
able to separate classes well are re-used, and their ability to separate classes can
be analysed by analysing the α values.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Network normalisation</title>
        <p>
          After training and prior to any analysis, the network is normalised to remove
potential cross-layer artefacts. Figure 4 demonstrates how it is possible to
introduce artefacts that would invalidate any layer-specific analysis, without changing
overall network behaviour. We therefore perform weight normalisation one layer
at a time; normalising the in-fan weight vector per node, and passing this norm
along to the out-fan weight vector of the same node at the next layer.
Specifically, we calculate the node-specific norm of the in-fan weight vector at that
node; and use this value to both divide the in-fan weight vector with, and
multiply the out-fan weight vector with. This has the added benefit that all weights
now have a norm of 1, which means that the activation values at any node are
actually simply the projections of all the sample values onto the weight vector.
In order to demonstrate the concept of generation gaps, we train ReLU-activated
networks with the architecture of Figure 2 using the MNIST [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] dataset. We
use a fairly standard training setup: initialising weights and biases with He
initialisation [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]; optimising the Cross-Entropy loss with Adam as optimiser [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ];
and performing a grid search of learning rates across different training seeds.
No regularisation apart from early stopping is used. All hyper-parameters are
optimised on a 5,000 sample validation set. We use the same protocol to train
networks with hidden layers of 100, 300 and 600 hidden nodes, and select the
sample networks listed in Table 1 for analysis. Results across the three
architectures are very similar (and per architecture, identical before and after weight
normalisation).
        </p>
        <p>
          We confirm that the expected ratios hold in practice, by analysing the weight
and gap ratios for correctly classified samples. We find that these ratios do indeed
hold, as illustrated in Figures 5 to 7. Ratios were confirmed for all samples and
networks, but in these figures we extract the weight and activation gaps for 300
random samples correctly classified by the 300-node model.
(a) unnormalised
We build on results from [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] which views DNNs as layers of collaborating
classifiers, and probe the interplay between the local and global nature of nodes
in a hidden layer of an MLP. We ask what the theoretical conditions are for
perfect classification of a set of samples from different classes, and answer this
question for a simplified architecture, which adds a summary layer prior to the
final output layer.
        </p>
        <p>While the architecture is simplified, it is not trivial: a summary layer can be
added to a pre-trained MLP, which makes the 2-node summary layer less of a
restriction than it initially seems. Also, MLPs with multiple hidden layers can
be considered as consisting of multiple 3-layer sub-networks stacked on top of
one another; it is not immediately clear to what extent the same factors that are
important for a 3-layer MLP are important for a sub-network within a larger
structure. Recurrence and residual connections are problematic for theoretical
analysis, but convolutional layers can to a large extent be analysed as sparse,
connected-weight MLPs. All in all, while it should be possible to extend the ideas
of this paper to more complex architectures, our first goal is to fully understand
the generalisation ability of straightforward ReLU-activated MLPs.</p>
        <p>For a simplified MLP, we have shown how activation gaps are formed at node
level, and how the consistency of these gaps gives rise to the classification ability
of a layer. Specifically, we show that nodes act as local ‘gap generators’ between
pairs of samples. These gap generators are formed during gradient descent, when
the node-supported cost (Section 2.2) is used to find directions in a layer-specific
input space, which are useful for pulling samples apart. Gaps are then re-used
across multiple samples and their manner of use sheds light on the characteristics
of nodes that we expect to better support generalisation.</p>
        <p>In this work we do not yet use either the gaps or the ratios to probe the
networks themselves: How are gaps distributed across nodes? How are nodes
(and the gaps they create) re-used? From the interaction between sample sets
and gaps, are some nodes more general and others more specific? What does this
say about the generalisation ability of the network? While we have not answered
any of these questions, we have developed a conceptual tool that can be used to
probe networks for answers to questions such as these.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This work was partially supported by the National Research Foundation (NRF,
Grant Number 109243). Any opinions, findings, conclusions or recommendations
expressed in this material are those of the author and the NRF does not accept
any liability in this regard.</p>
    </sec>
    <sec id="sec-5">
      <title>Ratios for unnormalised networks</title>
      <p>(b) normalised</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bartlett</surname>
            ,
            <given-names>P.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foster</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Telgarsky</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>Spectrally-normalized margin bounds for neural networks</article-title>
          .
          <source>arXiv preprint (also NeurIPS 30) arXiv:1706.08498</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Davel</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theunissen</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pretorius</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barnard</surname>
          </string-name>
          , E.:
          <article-title>DNNs as layers of cooperating classifiers</article-title>
          .
          <source>In: AAAI Conference on Artificial Intelligence</source>
          , accepted for publication (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dinh</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pascanu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Sharp minima can generalize for deep nets</article-title>
          .
          <source>arXiv preprint (also ICML</source>
          <year>2017</year>
          ) arXiv:
          <fpage>1703</fpage>
          .
          <string-name>
            <surname>04933v2</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep Learning</article-title>
          . MIT Press (
          <year>2016</year>
          ), http://www.deeplearningbook.org
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification</article-title>
          .
          <source>arXiv preprint (also ICCV</source>
          <year>2015</year>
          ) arXiv:
          <fpage>1502</fpage>
          .
          <year>01852</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mobahi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Predicting the generalization gap in deep networks with margin distributions</article-title>
          .
          <source>arXiv preprint (also ICLR</source>
          <year>2019</year>
          ) arXiv:
          <year>1810</year>
          .00113v2 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kawaguchi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pack Kaelbling</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Generalization in deep learning</article-title>
          .
          <source>arXiv preprint arXiv:1710.05468v5</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Keskar</surname>
            ,
            <given-names>N.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mudigere</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nocedal</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smelyanskiy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>P.T.P.</given-names>
          </string-name>
          :
          <article-title>On largebatch training for deep learning: Generalization gap and sharp minima</article-title>
          .
          <source>arXiv preprint (also ICLR</source>
          <year>2017</year>
          ) arXiv:
          <fpage>1609</fpage>
          .04836 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>International Conference on Learning Representations (ICLR)</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lecun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haffner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Gradient-based learning applied to document recognition</article-title>
          .
          <source>Proceedings of the IEEE</source>
          <volume>86</volume>
          (
          <issue>11</issue>
          ),
          <fpage>2278</fpage>
          -
          <lpage>2324</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Montavon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braun</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          , Mu¨ller, K.R.:
          <article-title>Kernel analysis of deep networks</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2563</fpage>
          -
          <lpage>2581</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Neyshabur</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhojanapalli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McAllester</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srebro</surname>
          </string-name>
          , N.:
          <article-title>Exploring generalization in deep learning</article-title>
          .
          <source>arXiv preprint (also NeurIPS 30) arXiv:1706.08947</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Raghu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poole</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kleinberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganguli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sohl-Dickstein</surname>
          </string-name>
          , J.:
          <article-title>On the expressive power of deep neural networks</article-title>
          .
          <source>In: Proceedings of the 34th International Conference on Machine Learning (ICML)</source>
          . vol.
          <volume>70</volume>
          , pp.
          <fpage>2847</fpage>
          -
          <lpage>2854</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Shwartz-Ziv</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tishby</surname>
          </string-name>
          , N.:
          <article-title>Opening the black box of deep neural networks via information</article-title>
          .
          <source>arXiv preprint arXiv:1703.00810</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.N.</given-names>
          </string-name>
          :
          <article-title>Statistical Learning Theory</article-title>
          . Wiley-Interscience (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rasul</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vollgraf</surname>
          </string-name>
          , R.:
          <article-title>Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms</article-title>
          .
          <source>arXiv preprint arXiv:1708.07747v2</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hardt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Recht</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Understanding deep learning requires rethinking generalization</article-title>
          .
          <source>arXiv preprint (also ICLR</source>
          <year>2017</year>
          ) arXiv:
          <fpage>1611</fpage>
          .03530 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>