<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Insights regarding over tting on noise in deep learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marthinus W. Theunissen[</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marelie H. D</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multilingual Speech Technologies, North-West University</institution>
          ,
          <addr-line>South Africa; and CAIR</addr-line>
          ,
          <country country="ZA">South Africa</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The understanding of generalization in machine learning is in a state of ux. This is partly due to the relatively recent revelation that deep learning models are able to completely memorize training data and still perform appropriately on out-of-sample data, thereby contradicting long-held intuitions about generalization. The phenomenon was brought to light and discussed in a seminal paper by Zhang et al. [24]. We expand upon this work by discussing local attributes of neural network training within the context of a relatively simple and generalizable framework. We describe how various types of noise can be compensated for within the proposed framework in order to allow the global deep learning model to generalize in spite of interpolating spurious function descriptors. Empirically, we support our postulates with experiments involving overparameterized multilayer perceptrons and controlled noise in the training data. The main insights are that deep learning models are optimized for training data modularly, with di erent regions in the function space dedicated to tting distinct kinds of sample information. Detrimental over tting is largely prevented by the fact that di erent regions in the function space are used for prediction based on the similarity between new input data and that which has been optimized for.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep learning</kwd>
        <kwd>Machine learning</kwd>
        <kwd>Learning theory</kwd>
        <kwd>Generalization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The advantages of deep learning models over some of their antecedents include
their e cient optimization, scalability to high dimensional data, and
performance on data that was not optimized for [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The latter is arguably the most
important bene t. Machine learning, as a whole, has seen much progress in recent
years, and deep neural networks (DNNs) have become a cornerstone in
numerous important domains such as computer vision, natural language processing
and bioinformatics. Somewhat ironically, the surge of application potential that
deep learning has unlocked in industry has resulted in the development of
theoretically principled guidelines lagging behind implementation-speci c progress.
      </p>
      <p>A particular example of one such open theoretical question is how to consolidate
the observed ability of DNNs to generalize with classical notions of generalization
in machine learning.</p>
      <p>
        Before deep learning, a generally accepted principle with which to reason
about generalization in machine learning was that it is linked to the complexity
of the hypothesis space, and that the model's representational capacity should
be kept small so as to prevent it from approximating unrealistically complex
functions [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Such highly complex functions are not expected to be applicable
to the task being performed, and are usually a result of over tting the spurious
correlations found in the nite sample of training examples. Many complexity
metrics have been proposed and adapted in an attempt to consolidate this
intuition, however, these metrics have consistently failed to robustly account for the
generalization observed in deep learning models.
      </p>
      <p>
        An in uential paper by Zhang et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] has demonstrated that: (1) DNNs
can e ciently t various types of noise and still generalize well, and (2)
contemporary explicit regularization is not required to enable good generalization.
These ndings are in stark contradiction with complexity-based principles of
generalization: deep learning models are shown to have representational capacities
large enough to approximate extremely complex functions (potentially
memorizing the entire data set) and still have very low out-of-sample test error.
      </p>
      <p>
        In this paper we further investigate the e ect of noise in training data with
regards to generalization. Where [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] contributed towards the understanding of
generalization by pointing out misconceptions with regards to model capacity
and regularization, our contribution is to provide insight into a particular
phenomenon that can (at least partially) account for the observed ability of a DNN
to generalize in a very similar experimental framework. We de ne noise as any
input-output relationship that is not predictable, or not conducive to the model
tting the true training data, or approximating the true data distribution. The
following types of noise are investigated (detailed de nitions are provided in
appendix B):
1. Label corruption: For every a ected sample, its training label is replaced
with an alternative selected uniformly from all other possibilities.
2. Gaussian input corruption: For every a ected sample, all its input
features are replaced with Gaussian noise with unchanged mean and standard
deviation.
3. Structured input corruption: For every a ected sample, its input
features are replaced by an alternative sample that is completely distinguishable
from the true input but consistent per class. (For example, replacing selected
images from one class with images of a completely di erent object from a
di erent data set, but keeping the original class label.)
      </p>
      <p>
        These types of noise have been chosen to encompass those investigated in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ],
place more emphasis on varying the noise, and still maintain the analytically
convenient per-sample implementation of the framework. It is possible to introduce
other types of noise, such as stochastic corruptions in the feature representations
at each layer within the architecture or directly impeding the optimization of the
model. However, such noise is very analogous to existing regularizing techniques
(dropout, weight decay, node pruning etc.) and not aligned with the goals of this
paper, namely, to shed light on how a DNN with as few as possible regularizing
factors manages noisy data.
      </p>
      <p>
        The empirical investigation will be limited to extremely overparameterized
multilayer perceptron (MLP) architectures with ReLU activations, and 2
related classi cation data sets. These architectures are simple enough to allow for
e cient analysis but function using the same principles as more complex
architectures. The 2 classi cation tasks are MNIST [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and FMNIST [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. These
data sets are on the low end of the spectrum with regard to di culty, with
FMNIST slighly more di cult than MNIST; both are widely used to investigate the
theoretical properties of DNNs [
        <xref ref-type="bibr" rid="ref11 ref20 ref9">9, 11, 20</xref>
        ].
      </p>
      <p>The following section (Section 2) discusses some recent, notable work that
all have the goal of characterizing generalization in deep learning. In Section 3
a theoretical discussion is provided that de nes a view of DNN training that is
useful to interpret the results that follow. A detailed analysis of the ability of an
MLP to respond to noisy data is provided and discussed in Section 4. Findings
are summarized in the nal section with a focus on implications relating to
generalization.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Many attempts at conceptualizing the generalization ability of DNNs focus on
the stability with which a model accurately predicts output values in the
presence of varying input values. One popular approach is to analyse the geometry
of the loss landscape at the optimum [
        <xref ref-type="bibr" rid="ref1 ref12 ref8">8, 1, 12</xref>
        ]. This amounts to investigating
the overall loss value within a region of parameter space close to the optimal
parameterization obtained by the training process. The intuition is that a sharper
minimum or high curvature at the solution point indicates that the model will
be sensitive to small perturbations in the input space. Logically, this will lead
to poor generalization. A practical problem with this approach is that it su ers
heavily from the curse of dimensionality. This means that it is di cult to obtain
an unbiased and consistent perspective on the error surface in high dimensional
parameter spaces, which is the case in virtually all practical DNNs. The error
surface is typically mapped with dimensionality reductions, random searches,
or heuristic searches [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. A conceptual problem is that the loss value is easily
manipulable by weight scaling. For example, Dinh et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have shown that
a minimum can be made arbitrarily sharp/ at with no e ect on generalization
by exploiting simple symmetries in the weight scales of networks with recti ed
units.
      </p>
      <p>
        Another e ort at imposing stability in model predictions is to enforce
sparsity in the parameterization. The hope is that with a sparsely connected set
of trainable parameters, a reduced number of input parameters will a ect the
prediction accuracy. Like complexity metrics, this idea is borrowed from
statistical learning theory [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and with regards to deep learning, it has seen more
success in terms of improving the computational cost [
        <xref ref-type="bibr" rid="ref16 ref5">5, 16</xref>
        ] and interpretability
of DNNs than in improving generalization.
      </p>
      <p>
        From a representational point of view, some have argued that the functions
that overparameterized DNNs approximate are inherently insensitive to input
perturbations at the optimums to which they converge [
        <xref ref-type="bibr" rid="ref18 ref19 ref20 ref21">20, 21, 18, 19</xref>
        ]. These
investigations place a large emphasis on the design choices (depth, width,
activation functions, etc.) and are typically exploratory by nature.
      </p>
      <p>
        A new approach proposed by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] investigates DNN generalization by
means of \margin distributions", a measure of how far samples tend to be from
the decision boundary. These types of metrics have been successfully used to
indicate generalization ability in linear models such as support vector machines
(SVMs); however, determining the decision boundary in a DNN is not as simple.
Therefore, they used the rst-order Taylor approximation of the distance to the
decision boundary between the ground truth class and the second highest ranking
class. They were able to use a linear model, trained on the margin distributions
of numerous DNNs, to predict the generalization error of several out-of-sample
DNNs. This suggests that DNN generalization is strongly linked to the type of
representation the network uses to represent sample information throughout its
layers.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Hidden layers as general feature space converters</title>
      <p>Each hidden layer in an MLP is known to act as a learned representation,
converting the feature space from one representation to another. At the same time,
individual nodes act as local decision makers and respond to very speci c subsets
of the input space, as discussed below.
3.1</p>
      <sec id="sec-3-1">
        <title>Per-layer feature representations</title>
        <p>In a typical MLP training scenario, several hidden layers are stacked in series.
The output layer is the only one which is directly used to perform the global task.
The role that the hidden layers perform is one of enabling the approximation of
the necessary non-linear function through the use of the non-linearities produced
by their activation functions. In this sense, all hidden layers can be thought of
as general feature space converters, where the behaviour of one dimension in
one feature space is determined by a weighted contribution of all the dimensions
in the preceding feature space. Refer to Fig. 1 for a visual illustration of this
viewpoint.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Per node decision making</title>
        <p>If every hidden layer produces a feature space, then every node determines a
single dimension in this feature space. Some insights can be gained by
theoretically describing the behaviour of a node in terms of activation patterns in the
preceding layer. Let al be an activation vector at a layer l as a response to an
input sample x. If Wl is the weight matrix connecting l and the previous layer
l 1 then:
al = fa (Wl al 1)
(1)
where fa is some element-wise non-linear activation function. For every node i
in l the activation value is then:</p>
        <p>ali = fa wli al 1
where wli is the row in matrix Wl connecting layer l
be rewritten as:
ali = fa kwl kkal 1kcos</p>
        <p>
          i
with specifying the angle between the weight vector and the activation vector.
The pre-activation node value is determined by the product of the norm of the
activation vector (in the previous layer) and the norm of the relevant weight
vector (in the current layer) scaled by the cosine similarity of the two vectors.
As a result, if the activation function is a recti ed linear unit (ReLU) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and a
bias is not used, the angle between the activation vector and the weight vector
has to be 2 ( 90 ; 90 ) for the sample information to be propagated by the
node. In other words, the node is activated for samples that produce an
activation pattern in the preceding layer with a cosine similarity larger than 0 in
terms of the weight vector. This criterion holds regardless of the activation or
weight strengths. (When a bias is used, the threshold angles are di erent, but
the concept remains the same.)
1 and node i. This can
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Sample sets</title>
        <p>
          When training a ReLU-activated network, the corresponding weight vector is
only updated to optimize the global error in terms of those speci c samples
for which a node is active, referred to as the \sample set" of that node. This is
because weights are updated according to the gradients which can only propagate
(2)
(3)
back through the network to the weight vector if the corresponding node is active.
In this sense, the node weight vector acts as a hyperplane in the feature space of
the preceding layer. This hyperplane corresponds to the points where the weight
vector is equal to zero (or the bias value if one is present). Samples which are
located on one side of the hyperplane are prevented from having an e ect on
the weight vector values, and samples on the other side are used to update the
weights, thereby dictating the behaviour of one dimension in the following feature
space. The actual weight updates are a ected by the representations of sample
information in all the layers following the current one. This phenomenon has
been investigated and discussed in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], where several types of per-layer classi ers
were constructed from the nodes of trained MLPs. These classi ers were shown
to perform at levels comparable to the global network from which they were
created.
        </p>
        <p>To summarize this theoretical view point of ReLU-activated MLP training:
1. Hidden layers represent sample information in a feature space unique to
every layer.
2. The representations of sample information are created on a per dimension
basis by the weight vectors and activation functions linking the nodes to the
preceding layer.
3. A node is only active for samples with a directional similarity in the previous
feature space.
4. These sets of samples are used to update the weight vectors during training
and (by extension) the representation used by the following feature space.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Noise in the data</title>
      <sec id="sec-4-1">
        <title>Model performance</title>
        <p>
          In order to investigate the sample sets and activation/weight vector interactions,
several MLPs, containing 10 hidden layers of 512 nodes each, are trained on
varying levels of noise. A standard experimental setup is used, as described in
appendix A. Fig. 2 shows the resulting performance of the di erent models,
when tested on uncorrupted test data. All models were able to easily t the
noisy training data, corroborating the ndings of [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
        </p>
        <p>Notice that, when analyzing label corruption, there is a near linear inverse
correlation between the amount of label noise and the model's ability to
generalize to unseen data. This suggests that either:
1. the models are memorizing sample-speci c input-output relationships and a
certain portion of the unseen data is similar enough to a corresponding
portion of uncorrupted training samples to facilitate appropriate generalization;
or
2. the global approximated function is somehow compartmentalised to contain
fundamental rules about the task in some regions and ad hoc rules with
which to correctly classify the corrupted samples in other regions.
(a) label corruption
(b) gaussian input corrup- (c) structured input
cor</p>
        <p>tion ruption</p>
        <p>Observe from the results of the two input corruptions that noise in the input data
has an almost negligible e ect on generalization up to the point where there is an
insu cient amount of true data in the set with which to learn. This threshold is
expected to change with more di cult classi cation tasks (more class variance
and overlap), data sets containing fewer samples in total, and models with less
parameter exibility.</p>
        <p>The fact that input noise does not result in a linear reduction in generalization
ability still supports both the previous propositions. If the rst postulate is true,
then the samples with corrupted input data are memorized, but no samples in
the evaluation set are similar enough to them to incur additional generalization
error. If the second postulate is true, then the regions in the approximated
function that were determined by the corrupted input samples are simply never
used for classifying the uncorrupted evaluation set.</p>
        <p>It is also worth noting that the Gaussian and structured input corruptions
have very similar in uences on generalization. The models are therefore able to
generalize in the presence of input noise regardless of its predictability.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Cosine similarities</title>
        <p>The cosine similarity (cos as also used in Eq. 3) can be used as an estimate
of how much the representation of a sample in a preceding layer is directionally
similar to that of the set of samples for which a node tends to be active. By
measuring the average cosine similarity of samples in a node's sample set with
regards to the determinitive weight vector (wli in Eq. 3) and averaging over nodes
in a layer, it is possible to identify layers where samples tend to be grouped
together convincingly. That is, the samples are closely related (in the preceding
feature space) and the resulting activation values tend to be large.</p>
        <p>Using Eq. 3, the mean cosine similarity per-layer l (over all weight and active
sample pairs at every node) can be calculated as:
cosine(l) =
1</p>
        <p>X
jNlj i2Nl
1</p>
        <p>X
jAij a2Ai kwlikkal 1k
ali
!
(4)
where Nl is a set of all the nodes in layer l and Ai is a set of all positive activation
values at node i.</p>
        <p>Fig. 3 shows this metric for models trained on various amounts of noise. It
can be observed that noise in either the input or labeling data results in the
depth at which high mean cosine similarities are obtained being deeper in the
noise-corrupted networks compared to the baseline models. Additionally, take
note that the cosine similarities for the structured input noise are more spread
out across layers than the other two types of noise, which is more consistent with
the baseline model. Lastly, it is clear from the results of the models containing
noise in the labeling data that a \convincing" representation of training data
is obtained at around layer 6. Very small cosine similarities are observed in the
earlier layers of all models under any noise conditions.</p>
        <p>(a) label corruption
The noise in the data sets is introduced probabilistically on a per sample basis
(see Alg. 1 to 3). This provides a convenient way to investigate the composition
of sample sets with regards to noise. Fig. 4 shows how sample sets can consist
of di erent ratios of true and corrupted sample information.</p>
        <p>We de ne the polarization of a node for a class as the amount the sample
set of a node i favours either the corrupted or uncorrupted samples of a class c,
respectively. This is de ned as follows:
polarization(c; i) = 2
where Aic is a set of all positive activation values at node i in response to samples
from class c, and Afic is a corresponding set limited to corrupted samples. By
averaging over nodes and classes, a per-layer mean polarization values can be
obtained with the following equation:
polarization(l) =
1</p>
        <p>X
jKjjNlj c2K;i2Nl
polarization(c; i)
where K is a set of all classes. Dead nodes (that are not activated for any sample)
are omitted when performing the averaging operations. A dead node does not
contribute to the global function and merely reduces the dimensionality of the
feature space in the corresponding layer by 1.
(5)
(6)</p>
        <p>The polarization metric indicates how much the sample sets formed within a
layer are polarized between true class information and corrupted class
information, for any given class. The relevant polarization values are provided in Fig. 5.
The main observation is that sample sets tend to be highly in favour of true or
corrupted sample information, and this is especially true in the later layers of any
given model. The label and structured input corruption produces lower
polarization earlier in the model, but this is to be expected seeing as the input data has
strong coherence based on correlations in class-related input structures. These
ndings support the second postulate in section 4.1. It appears that sub-regions
in the function space are dedicated to processing di erent kinds of training data.
(a) label corruption
(b) gaussian input
corruption
In this section we have shown that several overparameterized DNNs with no
explicit regularization are able to generalize well with evident spurious
inputoutput relationships present in the training data. We have used empirically
evaluated metrics to show that, in the presence of noise, well-separated per node
sample sets are generated later in the network compared to baseline cases with
no noise. Additionally, these well-separated sets of samples are highly polarized
between samples containing true task information and samples without.</p>
        <p>If we accept the viewpoint that nodes in hidden layers act as collaborating
feature di erentiators (separating samples based on feature criteria that are
unique to each node) to generate informative feature spaces, then each layer
also acts as a mixture model tting samples based on their representation in
the preceding layer. A key insight is that each model component is not tted
to all samples in the data set. Model components (referring to a node and its
corresponding weight vector) are optimized on a speci c subset of the population
as determined by the activation patterns in the preceding layer. And, as we
have observed, these subpopulations can and tend to be composed of true task
information or false task information.</p>
        <p>In this sense, some model components of the network are dedicated to
correctly classifying uncorrupted samples, and others are dedicated to corrupted
samples. To generalize this observation to training scenarios without explicit
data corruption, it can be observed that in most data sets samples from a
speci c class have varied representations. Without de ning some representations as
noise, they are still processed in the same way the structured input corruption
data is processed in this paper, hence the strong coherence between the
baseline models and those containing structured input noise. This is also why it is
possible to perform some types of multitask learning. One example would be
combining MNIST and FMNIST. In this scenario the training set will contain
120 000 examples with consistent training labels, but two distinct
representations in the input space. For example, class 6 will be correctly assigned to the
written number 6 and a shirt.</p>
        <p>To summarise, DNNs do over t on noise, albeit benignly. The vast
representational capacity and non-linearities enable subcomponents of the network to be
dedicated to processing subpopulations of the training data. When out-of-sample
data is to be processed, the regions most similar to the unseen data is used to
make predictions, thereby preventing the model components tted to noise from
a ecting generalization.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have investigated the phenomenon that DNNs are able to generalize in spite
of perfectly tting noisy training data. An interesting mechanism that is intrinsic
to non-linear function approximating in deep MLPs has been described and
supporting empirical analyses have been provided. The ndings in this paper
suggest that good generalization in large DNNs, in spite of extreme noise in
the training data, is a result of the modular way training samples are tted
during optimization. Future work will attempt to construct a formal framework
with which to characterize the collaborating sub-components and, based on this,
possibly produce some theoretically grounded predictors of generalization.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work was partially supported by the National Research Foundation (NRF,
Grant Number 109243). Any opinions, ndings, conclusions or recommendations
expressed in this material are those of the authors and the NRF does not accept
any liability in this regard.
A</p>
    </sec>
    <sec id="sec-7">
      <title>Experimental setup</title>
      <p>
        The classi cation data sets that are used for empirical investigations are MNIST
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and FMNIST[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. These data sets are drop-in replacements for each other,
meaning that they contain the same number of input dimensions, classes, and
examples. Namely, 28x28 (784 attened), 10, and 70 000, respectively. In all
training scenarios a random split of 55 000 training examples, 5 000 validation
examples, and 10 000 evaluation examples are used. The generalization gap is
calculated by subtracting the evaluation accuracy from the training accuracy.
      </p>
      <p>
        All models are trained for 400 epochs with randomly selected batches of 128
examples. This amounts to 171 600 parameter updates in total. In order to ensure
that the training data is completely tted, this is repeated for at least 2 random
uniform parameter initialization and the model that best ts the training data is
selected to be analysed. A xed MLP architecture containing 10 hidden layers of
512 nodes each is used, with a single bias node at the rst layer. ReLU activation
functions are used at all hidden layers. The popular Adam [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] optimizer is used
with a simple mean squared error (MSE) loss function at the output. No output
activation function is employed and no additional regularizing techniques are
implemented. That includes batch normalization, weight decay, drop out, data
augmentation, and early stopping.
      </p>
      <p>B</p>
    </sec>
    <sec id="sec-8">
      <title>Algorithms for adding noise</title>
      <p>
        The rst type of noise is identical to the \Partially corrupted labels" in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]
except for the selection of alternative labels. Instead of selecting a random label
uniformly, we select a random alternative (not including the true label)
uniformly. This results in a data set that is corrupted an amount closer to the one
represented by the probability value P that determines corruption levels. See
Algorithm 1. The second type of noise is similar to the \Gaussian" in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], with
the di erence being that instead of replacing input data with Gaussian noise
selected from a variable with identical mean and variance to the data set, we
determine the mean and variance of the Gaussian in terms of the speci c sample
being corrupted. See Algorithm 2. The third type of noise replaces input data
with alternatives that are completely di erent to any in the true data set but
still structured in a way that is predictable. This is accomplished by rotating
the sample 90 counter-clockwise about the center, followed by an inversion of
the feature values. Inversion refers to subtracting the feature value from the
maximum feature value in the sample. See Algorithm 3.
      </p>
      <p>4
5</p>
      <sec id="sec-8-1">
        <title>Algorithm 1: Label corruption</title>
        <p>Input: A train set of labelled examples (X(train), Y(train)), a set of possible
labels C, and a probability value P</p>
        <p>Output: A train set of corrupted examples (X(train), Y(t^rain))
1 for y in Y(train) do
2 if B(1; P ) then
3 y^ U [Cnfyg]</p>
      </sec>
      <sec id="sec-8-2">
        <title>Algorithm 2: Gaussian input corruption</title>
        <p>Input: A train set of labelled examples (X(train), Y(train)), a set of possible
labels C, and a probability value P</p>
        <p>Output: A train set of corrupted examples (X(tr^ain), Y(train))
1 for x in X(train) do
2 if B(1; P ) then
3 x^ g
4 where g is a vector sampled from N [ x; x]</p>
      </sec>
      <sec id="sec-8-3">
        <title>Algorithm 3: Structured input corruption</title>
        <p>Input: A train set of labelled examples (X(train), Y(train)), a set of possible
labels C, and a probability value P</p>
        <p>Output: A train set of corrupted examples (X(tr^ain), Y(train))
1 for x in X(train) do
2 if B(1; P ) then
3 x^ invert(rotate(x))
6 rotate is a 90 rotation counter clockwise about the origin
7 invert is an inversion of all values in the vector
C</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Additional results</title>
      <p>The mean cosine similarities and mean polarization values for analyses conducted
on the FMNIST data set are provided in Fig. 6 and 7, respectively. Notice that
most of the same observations can be made when compared to the MNIST results
in section 4.2 and 4.3. It is, however, worth noting that for a classi cation task
with more overlap in the input space such as FMNIST, the well-separated sample
sets are generated at even later layers and the polarization is higher overall.
(a) label corruption
(b) gaussian input
corruption
(c) structured input
corruption</p>
      <p>(a) label corruption
(b) gaussian input
corruption
(c) structured input
corruption</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chaudhari</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choromanska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soatto</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , LeCun,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Baldassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Borgs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Chayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.T.</given-names>
            ,
            <surname>Sagun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Zecchina</surname>
          </string-name>
          , R.: Entropy-SGD:
          <article-title>Biasing gradient descent into wide valleys</article-title>
          .
          <source>ArXiv abs/1611</source>
          .
          <year>01838</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Davel</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theunissen</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pretorius</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barnard</surname>
          </string-name>
          , E.:
          <article-title>DNNs as layers of cooperating classi ers</article-title>
          .
          <source>In: AAAI</source>
          <year>2020</year>
          (
          <article-title>submitted for publication)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dinh</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pascanu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Sharp minima can generalize for deep nets</article-title>
          .
          <source>arXiv preprint arXiv:1703.04933v2</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Elsayed</surname>
            ,
            <given-names>G.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mobahi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Regan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Large margin deep networks for classi cation</article-title>
          .
          <source>In: NeurIPS</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gale</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elsen</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hooker</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The state of sparsity in deep neural networks</article-title>
          .
          <source>ArXiv abs/1902</source>
          .09574 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Glorot</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Deep sparse recti er neural networks</article-title>
          .
          <source>In: AISTATS</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep Learning</article-title>
          . MIT Press (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Flat minima</article-title>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          ,
          <issue>1</issue>
          {
          <fpage>42</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Jastrzebski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kenton</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Storkey</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          :
          <article-title>On the relation between the sharpest directions of dnn loss and the sgd step length</article-title>
          .
          <source>In: ICLR</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mobahi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Predicting the generalization gap in deep networks with margin distributions. arxiv preprint (In ICLR 2019</article-title>
          ) arXiv:
          <year>1810</year>
          .00113v2 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kawaguchi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaelbling</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Generalization in deep learning</article-title>
          .
          <source>ArXiv abs/1710</source>
          .05468 (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Keskar</surname>
            ,
            <given-names>N.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mudigere</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nocedal</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smelyanskiy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>P.T.P.</given-names>
          </string-name>
          :
          <article-title>On largebatch training for deep learning: Generalization gap and sharp minima</article-title>
          .
          <source>ArXiv abs/1609</source>
          .04836 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint (In ICLR</source>
          <year>2014</year>
          ) arXiv:
          <fpage>1412</fpage>
          .6980 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lecun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ha</surname>
            <given-names>ner</given-names>
          </string-name>
          , P.:
          <article-title>Gradient-based learning applied to document recognition</article-title>
          .
          <source>Proceedings of the IEEE</source>
          <volume>86</volume>
          (
          <issue>11</issue>
          ),
          <volume>2278</volume>
          {
          <fpage>2324</fpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , G.,
          <string-name>
            <surname>Studer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Visualizing the loss landscape of neural nets</article-title>
          . In: NeurIPS (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Loroch</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfreundt</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wehn</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keuper</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Sparsity in deep neural networks - an empirical investigation with tensorquant</article-title>
          .
          <source>In: DMLE/IOTSTREAMING@PKDD/ECML</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Maurer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pontil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Structured sparsity and generalization</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>13</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Montufar</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pascanu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>On the number of linear regions of deep neural networks</article-title>
          .
          <source>ArXiv abs/1402</source>
          .
          <year>1869</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Neal</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mittal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baratin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tantia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scicluna</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lacoste-Julien</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitliagkas</surname>
            ,
            <given-names>I.:</given-names>
          </string-name>
          <article-title>A modern take on the bias-variance tradeo in neural networks</article-title>
          .
          <source>ArXiv abs/1810</source>
          .08591 (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Novak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bahri</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abola</surname>
            <given-names>a</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>D.A.</given-names>
            ,
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Sohl-Dickstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          :
          <article-title>Sensitivity and generalization in neural networks: an empirical study</article-title>
          .
          <source>In: International Conference on Learning Representations (ICLR)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Raghu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poole</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kleinberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganguli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sohl-Dickstein</surname>
          </string-name>
          , J.:
          <article-title>On the expressive power of deep neural networks</article-title>
          .
          <source>In: Proceedings of the 34th International Conference on Machine Learning</source>
          . pp.
          <volume>2847</volume>
          {
          <issue>2854</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.N.:</given-names>
          </string-name>
          <article-title>An overview of statistical learning theory</article-title>
          .
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rasul</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vollgraf</surname>
          </string-name>
          , R.:
          <article-title>Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms</article-title>
          .
          <source>arXiv preprint arXiv:1708.07747v2</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hardt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Recht</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Understanding deep learning requires rethinking generalization</article-title>
          .
          <source>arXiv preprint (In ICLR</source>
          <year>2017</year>
          ) arXiv:
          <fpage>1611</fpage>
          .03530 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>