=Paper=
{{Paper
|id=Vol-2540/article_5
|storemode=property
|title=Insights regarding overfitting on noise in deep learning 
|pdfUrl=https://ceur-ws.org/Vol-2540/FAIR2019_paper_46.pdf
|volume=Vol-2540
|authors=Marthinus W. Theunissen,Marelie H. Davel,Etienne Barnard
|dblpUrl=https://dblp.org/rec/conf/fair2/TheunissenDB19
}}
==Insights regarding overfitting on noise in deep learning ==
<pdf width="1500px">https://ceur-ws.org/Vol-2540/FAIR2019_paper_46.pdf</pdf>
<pre>
    Insights regarding overfitting on noise in deep
                       learning

          Marthinus W. Theunissen[0000−0002−7456−7769] , Marelie H.
      Davel[0000−0003−3103−5858] , and Etienne Barnard[0000−0003−2202−2369]
     {tiantheunissen, marelie.davel, etienne.barnard}@gmail.com

       Multilingual Speech Technologies, North-West University, South Africa;
                              and CAIR, South Africa


       Abstract. The understanding of generalization in machine learning is
       in a state of flux. This is partly due to the relatively recent revelation
       that deep learning models are able to completely memorize training data
       and still perform appropriately on out-of-sample data, thereby contra-
       dicting long-held intuitions about generalization. The phenomenon was
       brought to light and discussed in a seminal paper by Zhang et al. [24].
       We expand upon this work by discussing local attributes of neural net-
       work training within the context of a relatively simple and generalizable
       framework. We describe how various types of noise can be compensated
       for within the proposed framework in order to allow the global deep
       learning model to generalize in spite of interpolating spurious function
       descriptors. Empirically, we support our postulates with experiments in-
       volving overparameterized multilayer perceptrons and controlled noise in
       the training data.
       The main insights are that deep learning models are optimized for train-
       ing data modularly, with different regions in the function space dedicated
       to fitting distinct kinds of sample information. Detrimental overfitting is
       largely prevented by the fact that different regions in the function space
       are used for prediction based on the similarity between new input data
       and that which has been optimized for.


Keywords: Deep learning · Machine learning · Learning theory · Generalization


1    Introduction
The advantages of deep learning models over some of their antecedents include
their efficient optimization, scalability to high dimensional data, and perfor-
mance on data that was not optimized for [7]. The latter is arguably the most
important benefit. Machine learning, as a whole, has seen much progress in recent
years, and deep neural networks (DNNs) have become a cornerstone in numer-
ous important domains such as computer vision, natural language processing
and bioinformatics. Somewhat ironically, the surge of application potential that
deep learning has unlocked in industry has resulted in the development of the-
oretically principled guidelines lagging behind implementation-specific progress.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0)
2      M.W. Theunissen et al.

A particular example of one such open theoretical question is how to consolidate
the observed ability of DNNs to generalize with classical notions of generalization
in machine learning.
     Before deep learning, a generally accepted principle with which to reason
about generalization in machine learning was that it is linked to the complexity
of the hypothesis space, and that the model’s representational capacity should
be kept small so as to prevent it from approximating unrealistically complex
functions [22]. Such highly complex functions are not expected to be applicable
to the task being performed, and are usually a result of overfitting the spurious
correlations found in the finite sample of training examples. Many complexity
metrics have been proposed and adapted in an attempt to consolidate this intu-
ition, however, these metrics have consistently failed to robustly account for the
generalization observed in deep learning models.
     An influential paper by Zhang et al. [24] has demonstrated that: (1) DNNs
can efficiently fit various types of noise and still generalize well, and (2) con-
temporary explicit regularization is not required to enable good generalization.
These findings are in stark contradiction with complexity-based principles of gen-
eralization: deep learning models are shown to have representational capacities
large enough to approximate extremely complex functions (potentially memo-
rizing the entire data set) and still have very low out-of-sample test error.
     In this paper we further investigate the effect of noise in training data with
regards to generalization. Where [24] contributed towards the understanding of
generalization by pointing out misconceptions with regards to model capacity
and regularization, our contribution is to provide insight into a particular phe-
nomenon that can (at least partially) account for the observed ability of a DNN
to generalize in a very similar experimental framework. We define noise as any
input-output relationship that is not predictable, or not conducive to the model
fitting the true training data, or approximating the true data distribution. The
following types of noise are investigated (detailed definitions are provided in
appendix B):
 1. Label corruption: For every affected sample, its training label is replaced
    with an alternative selected uniformly from all other possibilities.
 2. Gaussian input corruption: For every affected sample, all its input fea-
    tures are replaced with Gaussian noise with unchanged mean and standard
    deviation.
 3. Structured input corruption: For every affected sample, its input fea-
    tures are replaced by an alternative sample that is completely distinguishable
    from the true input but consistent per class. (For example, replacing selected
    images from one class with images of a completely different object from a
    different data set, but keeping the original class label.)
    These types of noise have been chosen to encompass those investigated in [24],
place more emphasis on varying the noise, and still maintain the analytically con-
venient per-sample implementation of the framework. It is possible to introduce
other types of noise, such as stochastic corruptions in the feature representations
at each layer within the architecture or directly impeding the optimization of the
                      Insights regarding overfitting on noise in deep learning     3

model. However, such noise is very analogous to existing regularizing techniques
(dropout, weight decay, node pruning etc.) and not aligned with the goals of this
paper, namely, to shed light on how a DNN with as few as possible regularizing
factors manages noisy data.
     The empirical investigation will be limited to extremely overparameterized
multilayer perceptron (MLP) architectures with ReLU activations, and 2 re-
lated classification data sets. These architectures are simple enough to allow for
efficient analysis but function using the same principles as more complex archi-
tectures. The 2 classification tasks are MNIST [14] and FMNIST [23]. These
data sets are on the low end of the spectrum with regard to difficulty, with FM-
NIST slighly more difficult than MNIST; both are widely used to investigate the
theoretical properties of DNNs [9, 11, 20].
     The following section (Section 2) discusses some recent, notable work that
all have the goal of characterizing generalization in deep learning. In Section 3
a theoretical discussion is provided that defines a view of DNN training that is
useful to interpret the results that follow. A detailed analysis of the ability of an
MLP to respond to noisy data is provided and discussed in Section 4. Findings
are summarized in the final section with a focus on implications relating to
generalization.


2   Related work

Many attempts at conceptualizing the generalization ability of DNNs focus on
the stability with which a model accurately predicts output values in the pres-
ence of varying input values. One popular approach is to analyse the geometry
of the loss landscape at the optimum [8, 1, 12]. This amounts to investigating
the overall loss value within a region of parameter space close to the optimal pa-
rameterization obtained by the training process. The intuition is that a sharper
minimum or high curvature at the solution point indicates that the model will
be sensitive to small perturbations in the input space. Logically, this will lead
to poor generalization. A practical problem with this approach is that it suffers
heavily from the curse of dimensionality. This means that it is difficult to obtain
an unbiased and consistent perspective on the error surface in high dimensional
parameter spaces, which is the case in virtually all practical DNNs. The error
surface is typically mapped with dimensionality reductions, random searches,
or heuristic searches [15]. A conceptual problem is that the loss value is easily
manipulable by weight scaling. For example, Dinh et al. [3] have shown that
a minimum can be made arbitrarily sharp/flat with no effect on generalization
by exploiting simple symmetries in the weight scales of networks with rectified
units.
    Another effort at imposing stability in model predictions is to enforce spar-
sity in the parameterization. The hope is that with a sparsely connected set
of trainable parameters, a reduced number of input parameters will affect the
prediction accuracy. Like complexity metrics, this idea is borrowed from statis-
tical learning theory [17] and with regards to deep learning, it has seen more
4       M.W. Theunissen et al.

success in terms of improving the computational cost [5, 16] and interpretability
of DNNs than in improving generalization.
    From a representational point of view, some have argued that the functions
that overparameterized DNNs approximate are inherently insensitive to input
perturbations at the optimums to which they converge [20, 21, 18, 19]. These
investigations place a large emphasis on the design choices (depth, width, acti-
vation functions, etc.) and are typically exploratory by nature.
    A new approach proposed by [4] and [10] investigates DNN generalization by
means of “margin distributions”, a measure of how far samples tend to be from
the decision boundary. These types of metrics have been successfully used to
indicate generalization ability in linear models such as support vector machines
(SVMs); however, determining the decision boundary in a DNN is not as simple.
Therefore, they used the first-order Taylor approximation of the distance to the
decision boundary between the ground truth class and the second highest ranking
class. They were able to use a linear model, trained on the margin distributions
of numerous DNNs, to predict the generalization error of several out-of-sample
DNNs. This suggests that DNN generalization is strongly linked to the type of
representation the network uses to represent sample information throughout its
layers.


3     Hidden layers as general feature space converters
Each hidden layer in an MLP is known to act as a learned representation, con-
verting the feature space from one representation to another. At the same time,
individual nodes act as local decision makers and respond to very specific subsets
of the input space, as discussed below.

3.1   Per-layer feature representations
In a typical MLP training scenario, several hidden layers are stacked in series.
The output layer is the only one which is directly used to perform the global task.
The role that the hidden layers perform is one of enabling the approximation of
the necessary non-linear function through the use of the non-linearities produced
by their activation functions. In this sense, all hidden layers can be thought of
as general feature space converters, where the behaviour of one dimension in
one feature space is determined by a weighted contribution of all the dimensions
in the preceding feature space. Refer to Fig. 1 for a visual illustration of this
viewpoint.

3.2   Per node decision making
If every hidden layer produces a feature space, then every node determines a
single dimension in this feature space. Some insights can be gained by theoret-
ically describing the behaviour of a node in terms of activation patterns in the
preceding layer. Let al be an activation vector at a layer l as a response to an
                      Insights regarding overfitting on noise in deep learning   5


Fig. 1: An illustration of feature spaces [from left to right: input; 4 x hidden
layers; output layer] in a trained MLP. The model was trained to perform a 5-
class classification task of 100 randomly generated 50 dimensional input vectors.
Note that principle component analysis is used to reduce the dimensionality of
the actual feature spaces to 3 for this visual depiction.


input sample x. If Wl is the weight matrix connecting l and the previous layer
l − 1 then:
                              al = fa (Wl · al−1 )                         (1)
where fa is some element-wise non-linear activation function. For every node i
in l the activation value is then:

                                ail = fa wli · al−1
                                                    
                                                                           (2)

where wli is the row in matrix Wl connecting layer l − 1 and node i. This can
be rewritten as:
                          ail = fa kwli kkal−1 kcosθ
                                                     
                                                                           (3)
with θ specifying the angle between the weight vector and the activation vector.
The pre-activation node value is determined by the product of the norm of the
activation vector (in the previous layer) and the norm of the relevant weight
vector (in the current layer) scaled by the cosine similarity of the two vectors.
As a result, if the activation function is a rectified linear unit (ReLU) [6] and a
bias is not used, the angle between the activation vector and the weight vector
has to be ∈ (−90◦ , 90◦ ) for the sample information to be propagated by the
node. In other words, the node is activated for samples that produce an acti-
vation pattern in the preceding layer with a cosine similarity larger than 0 in
terms of the weight vector. This criterion holds regardless of the activation or
weight strengths. (When a bias is used, the threshold angles are different, but
the concept remains the same.)

3.3   Sample sets
When training a ReLU-activated network, the corresponding weight vector is
only updated to optimize the global error in terms of those specific samples
for which a node is active, referred to as the “sample set” of that node. This is
because weights are updated according to the gradients which can only propagate
6       M.W. Theunissen et al.

back through the network to the weight vector if the corresponding node is active.
In this sense, the node weight vector acts as a hyperplane in the feature space of
the preceding layer. This hyperplane corresponds to the points where the weight
vector is equal to zero (or the bias value if one is present). Samples which are
located on one side of the hyperplane are prevented from having an effect on
the weight vector values, and samples on the other side are used to update the
weights, thereby dictating the behaviour of one dimension in the following feature
space. The actual weight updates are affected by the representations of sample
information in all the layers following the current one. This phenomenon has
been investigated and discussed in [2], where several types of per-layer classifiers
were constructed from the nodes of trained MLPs. These classifiers were shown
to perform at levels comparable to the global network from which they were
created.
    To summarize this theoretical view point of ReLU-activated MLP training:

 1. Hidden layers represent sample information in a feature space unique to
    every layer.
 2. The representations of sample information are created on a per dimension
    basis by the weight vectors and activation functions linking the nodes to the
    preceding layer.
 3. A node is only active for samples with a directional similarity in the previous
    feature space.
 4. These sets of samples are used to update the weight vectors during training
    and (by extension) the representation used by the following feature space.


4     Noise in the data

4.1   Model performance

In order to investigate the sample sets and activation/weight vector interactions,
several MLPs, containing 10 hidden layers of 512 nodes each, are trained on
varying levels of noise. A standard experimental setup is used, as described in
appendix A. Fig. 2 shows the resulting performance of the different models,
when tested on uncorrupted test data. All models were able to easily fit the
noisy training data, corroborating the findings of [24].
    Notice that, when analyzing label corruption, there is a near linear inverse
correlation between the amount of label noise and the model’s ability to gener-
alize to unseen data. This suggests that either:

 1. the models are memorizing sample-specific input-output relationships and a
    certain portion of the unseen data is similar enough to a corresponding por-
    tion of uncorrupted training samples to facilitate appropriate generalization;
    or
 2. the global approximated function is somehow compartmentalised to contain
    fundamental rules about the task in some regions and ad hoc rules with
    which to correctly classify the corrupted samples in other regions.
                         Insights regarding overfitting on noise in deep learning       7


      (a) label corruption    (b) gaussian input corrup-    (c) structured input cor-
                                  tion                          ruption

Fig. 2: The generalization gap for models trained on MNIST (blue) and FMNIST
(green) at varying levels of three types of noise. The horizontal axis represents the
probability of any given training sample having been corrupted for the relevant
model. All models are overparameterized and have perfect, or close to perfect,
performance on the training data.


Observe from the results of the two input corruptions that noise in the input data
has an almost negligible effect on generalization up to the point where there is an
insufficient amount of true data in the set with which to learn. This threshold is
expected to change with more difficult classification tasks (more class variance
and overlap), data sets containing fewer samples in total, and models with less
parameter flexibility.
    The fact that input noise does not result in a linear reduction in generalization
ability still supports both the previous propositions. If the first postulate is true,
then the samples with corrupted input data are memorized, but no samples in
the evaluation set are similar enough to them to incur additional generalization
error. If the second postulate is true, then the regions in the approximated
function that were determined by the corrupted input samples are simply never
used for classifying the uncorrupted evaluation set.
    It is also worth noting that the Gaussian and structured input corruptions
have very similar influences on generalization. The models are therefore able to
generalize in the presence of input noise regardless of its predictability.


4.2     Cosine similarities

The cosine similarity (cos θ as also used in Eq. 3) can be used as an estimate
of how much the representation of a sample in a preceding layer is directionally
similar to that of the set of samples for which a node tends to be active. By
measuring the average cosine similarity of samples in a node’s sample set with
regards to the determinitive weight vector (wli in Eq. 3) and averaging over nodes
in a layer, it is possible to identify layers where samples tend to be grouped
together convincingly. That is, the samples are closely related (in the preceding
feature space) and the resulting activation values tend to be large.
8       M.W. Theunissen et al.

   Using Eq. 3, the mean cosine similarity per-layer l (over all weight and active
sample pairs at every node) can be calculated as:
                                                                !
                               1 X        1 X           ail
                µcosine (l) =                                                  (4)
                              |Nl |
                                    i∈N
                                        |Ai |
                                              a∈A
                                                  kwli kkal−1 k
                                         l           i


where Nl is a set of all the nodes in layer l and Ai is a set of all positive activation
values at node i.
    Fig. 3 shows this metric for models trained on various amounts of noise. It
can be observed that noise in either the input or labeling data results in the
depth at which high mean cosine similarities are obtained being deeper in the
noise-corrupted networks compared to the baseline models. Additionally, take
note that the cosine similarities for the structured input noise are more spread
out across layers than the other two types of noise, which is more consistent with
the baseline model. Lastly, it is clear from the results of the models containing
noise in the labeling data that a “convincing” representation of training data
is obtained at around layer 6. Very small cosine similarities are observed in the
earlier layers of all models under any noise conditions.


      (a) label corruption    (b) gaussian input cor-     (c) structured input cor-
                                  ruption                     ruption

Fig. 3: Mean cosine similarity per-layer at varying levels of three types of noise.
The horizontal axis represents the probability of any given training sample hav-
ing been corrupted for the relevant model. Results are measured on the training
samples of the MNIST data set. Equivalent results for FMNIST are included in
Appendix C.


4.3    Sample set corruption composition

The noise in the data sets is introduced probabilistically on a per sample basis
(see Alg. 1 to 3). This provides a convenient way to investigate the composition
                      Insights regarding overfitting on noise in deep learning     9

of sample sets with regards to noise. Fig. 4 shows how sample sets can consist
of different ratios of true and corrupted sample information.


Fig. 4: Per class sample set corruption ratios for the first hidden layer of a 3x100
MLP fitting MNIST training examples, including structured input corruptions
at a probability of 0.5. The nodes have been arranged in descending order of
sample set size. The true and corrupted portions of the sample sets are presented
in green and red, respectively.


    We define the polarization of a node for a class as the amount the sample
set of a node i favours either the corrupted or uncorrupted samples of a class c,
respectively. This is defined as follows:

                                                     fc |
                                                  1 |A
                         polarization(c, i) = 2    − ci                          (5)
                                                  2 |Ai |

where Aci is a set of all positive activation values at node i in response to samples
from class c, and A  fc is a corresponding set limited to corrupted samples. By
                       i
averaging over nodes and classes, a per-layer mean polarization values can be
obtained with the following equation:

                                    1          X
              µpolarization (l) =                       polarization(c, i)       (6)
                                  |K||Nl |
                                             c∈K,i∈Nl

where K is a set of all classes. Dead nodes (that are not activated for any sample)
are omitted when performing the averaging operations. A dead node does not
contribute to the global function and merely reduces the dimensionality of the
feature space in the corresponding layer by 1.
10      M.W. Theunissen et al.

    The polarization metric indicates how much the sample sets formed within a
layer are polarized between true class information and corrupted class informa-
tion, for any given class. The relevant polarization values are provided in Fig. 5.
The main observation is that sample sets tend to be highly in favour of true or
corrupted sample information, and this is especially true in the later layers of any
given model. The label and structured input corruption produces lower polariza-
tion earlier in the model, but this is to be expected seeing as the input data has
strong coherence based on correlations in class-related input structures. These
findings support the second postulate in section 4.1. It appears that sub-regions
in the function space are dedicated to processing different kinds of training data.


      (a) label corruption   (b) gaussian input cor-   (c) structured input cor-
                                 ruption                   ruption

Fig. 5: Mean per-layer corruption polarization at varying levels of three types
of noise. The horizontal axis represents the probability of any given training
sample having been corrupted for the relevant model. Results are measured on
the training samples of the MNIST data set. Equivalent results for FMNIST are
included in Appendix C.


4.4    Discussion
In this section we have shown that several overparameterized DNNs with no
explicit regularization are able to generalize well with evident spurious input-
output relationships present in the training data. We have used empirically eval-
uated metrics to show that, in the presence of noise, well-separated per node
sample sets are generated later in the network compared to baseline cases with
no noise. Additionally, these well-separated sets of samples are highly polarized
between samples containing true task information and samples without.
    If we accept the viewpoint that nodes in hidden layers act as collaborating
feature differentiators (separating samples based on feature criteria that are
unique to each node) to generate informative feature spaces, then each layer
                      Insights regarding overfitting on noise in deep learning   11

also acts as a mixture model fitting samples based on their representation in
the preceding layer. A key insight is that each model component is not fitted
to all samples in the data set. Model components (referring to a node and its
corresponding weight vector) are optimized on a specific subset of the population
as determined by the activation patterns in the preceding layer. And, as we
have observed, these subpopulations can and tend to be composed of true task
information or false task information.
     In this sense, some model components of the network are dedicated to cor-
rectly classifying uncorrupted samples, and others are dedicated to corrupted
samples. To generalize this observation to training scenarios without explicit
data corruption, it can be observed that in most data sets samples from a spe-
cific class have varied representations. Without defining some representations as
noise, they are still processed in the same way the structured input corruption
data is processed in this paper, hence the strong coherence between the base-
line models and those containing structured input noise. This is also why it is
possible to perform some types of multitask learning. One example would be
combining MNIST and FMNIST. In this scenario the training set will contain
120 000 examples with consistent training labels, but two distinct representa-
tions in the input space. For example, class 6 will be correctly assigned to the
written number 6 and a shirt.
     To summarise, DNNs do overfit on noise, albeit benignly. The vast represen-
tational capacity and non-linearities enable subcomponents of the network to be
dedicated to processing subpopulations of the training data. When out-of-sample
data is to be processed, the regions most similar to the unseen data is used to
make predictions, thereby preventing the model components fitted to noise from
affecting generalization.


5   Conclusion
We have investigated the phenomenon that DNNs are able to generalize in spite
of perfectly fitting noisy training data. An interesting mechanism that is intrinsic
to non-linear function approximating in deep MLPs has been described and
supporting empirical analyses have been provided. The findings in this paper
suggest that good generalization in large DNNs, in spite of extreme noise in
the training data, is a result of the modular way training samples are fitted
during optimization. Future work will attempt to construct a formal framework
with which to characterize the collaborating sub-components and, based on this,
possibly produce some theoretically grounded predictors of generalization.


6   Acknowledgements
This work was partially supported by the National Research Foundation (NRF,
Grant Number 109243). Any opinions, findings, conclusions or recommendations
expressed in this material are those of the authors and the NRF does not accept
any liability in this regard.
12      M.W. Theunissen et al.

References
 1. Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C.,
    Chayes, J.T., Sagun, L., Zecchina, R.: Entropy-SGD: Biasing gradient descent into
    wide valleys. ArXiv abs/1611.01838 (2016)
 2. Davel, M.H., Theunissen, M.W., Pretorius, A.M., Barnard, E.: DNNs as layers of
    cooperating classifiers. In: AAAI 2020 (submitted for publication)
 3. Dinh, L., Pascanu, R., Bengio, S., Bengio, Y.: Sharp minima can generalize for
    deep nets. arXiv preprint arXiv:1703.04933v2 (2017)
 4. Elsayed, G.F., Krishnan, D., Mobahi, H., Regan, K., Bengio, S.: Large margin deep
    networks for classification. In: NeurIPS (2018)
 5. Gale, T., Elsen, E., Hooker, S.: The state of sparsity in deep neural networks.
    ArXiv abs/1902.09574 (2019)
 6. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: AIS-
    TATS (2011)
 7. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
 8. Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Computation 9, 1–42 (1997)
 9. Jastrzebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y., Storkey, A.J.: On
    the relation between the sharpest directions of dnn loss and the sgd step length.
    In: ICLR (2018)
10. Jiang, Y., Krishnan, D., Mobahi, H., Bengio, S.: Predicting the generalization
    gap in deep networks with margin distributions. arxiv preprint (In ICLR 2019)
    arXiv:1810.00113v2 (2019)
11. Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. ArXiv
    abs/1710.05468 (2017)
12. Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-
    batch training for deep learning: Generalization gap and sharp minima. ArXiv
    abs/1609.04836 (2016)
13. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
    (In ICLR 2014) arXiv:1412.6980 (2014)
14. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
    document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
15. Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape
    of neural nets. In: NeurIPS (2017)
16. Loroch, D.M., Pfreundt, F.J., Wehn, N., Keuper, J.: Sparsity in deep
    neural networks - an empirical investigation with tensorquant. In:
    DMLE/IOTSTREAMING@PKDD/ECML (2018)
17. Maurer, A., Pontil, M.: Structured sparsity and generalization. Journal of Machine
    Learning Research 13 (2011)
18. Montúfar, G., Pascanu, R., Cho, K., Bengio, Y.: On the number of linear regions
    of deep neural networks. ArXiv abs/1402.1869 (2014)
19. Neal, B., Mittal, S., Baratin, A., Tantia, V., Scicluna, M., Lacoste-Julien, S.,
    Mitliagkas, I.: A modern take on the bias-variance tradeoff in neural networks.
    ArXiv abs/1810.08591 (2018)
20. Novak, R., Bahri, Y., Abolafia, D.A., Pennington, J., Sohl-Dickstein, J.: Sensi-
    tivity and generalization in neural networks: an empirical study. In: International
    Conference on Learning Representations (ICLR) (2018)
21. Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., Sohl-Dickstein, J.: On the ex-
    pressive power of deep neural networks. In: Proceedings of the 34th International
    Conference on Machine Learning. pp. 2847–2854 (2017)
                       Insights regarding overfitting on noise in deep learning    13

22. Vapnik, V.N.: An overview of statistical learning theory. IEEE Transactions on
    Neural Networks and Learning Systems (1999)
23. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for bench-
    marking machine learning algorithms. arXiv preprint arXiv:1708.07747v2 (2017)
24. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep
    learning requires rethinking generalization. arXiv preprint (In ICLR 2017)
    arXiv:1611.03530 (2016)


A    Experimental setup
The classification data sets that are used for empirical investigations are MNIST
[14] and FMNIST[23]. These data sets are drop-in replacements for each other,
meaning that they contain the same number of input dimensions, classes, and
examples. Namely, 28x28 (784 flattened), 10, and 70 000, respectively. In all
training scenarios a random split of 55 000 training examples, 5 000 validation
examples, and 10 000 evaluation examples are used. The generalization gap is
calculated by subtracting the evaluation accuracy from the training accuracy.
    All models are trained for 400 epochs with randomly selected batches of 128
examples. This amounts to 171 600 parameter updates in total. In order to ensure
that the training data is completely fitted, this is repeated for at least 2 random
uniform parameter initialization and the model that best fits the training data is
selected to be analysed. A fixed MLP architecture containing 10 hidden layers of
512 nodes each is used, with a single bias node at the first layer. ReLU activation
functions are used at all hidden layers. The popular Adam [13] optimizer is used
with a simple mean squared error (MSE) loss function at the output. No output
activation function is employed and no additional regularizing techniques are
implemented. That includes batch normalization, weight decay, drop out, data
augmentation, and early stopping.

B    Algorithms for adding noise
The first type of noise is identical to the “Partially corrupted labels” in [24]
except for the selection of alternative labels. Instead of selecting a random label
uniformly, we select a random alternative (not including the true label) uni-
formly. This results in a data set that is corrupted an amount closer to the one
represented by the probability value P that determines corruption levels. See
Algorithm 1. The second type of noise is similar to the “Gaussian” in [24], with
the difference being that instead of replacing input data with Gaussian noise
selected from a variable with identical mean and variance to the data set, we
determine the mean and variance of the Gaussian in terms of the specific sample
being corrupted. See Algorithm 2. The third type of noise replaces input data
with alternatives that are completely different to any in the true data set but
still structured in a way that is predictable. This is accomplished by rotating
the sample 90◦ counter-clockwise about the center, followed by an inversion of
the feature values. Inversion refers to subtracting the feature value from the
maximum feature value in the sample. See Algorithm 3.
14      M.W. Theunissen et al.


 Algorithm 1: Label corruption
   Input: A train set of labelled examples (X(train) , Y(train) ), a set of possible
            labels C, and a probability value P
                                                              ˆ
   Output: A train set of corrupted examples (X(train) , Y(train)    )
               (train)
 1 for y in Y          do
 2     if B(1, P ) then
 3         ŷ ∼ U [C\{y}]
 4     else
 5         ŷ ← y


 Algorithm 2: Gaussian input corruption
   Input: A train set of labelled examples (X(train) , Y(train) ), a set of possible
           labels C, and a probability value P
                                                    ˆ
   Output: A train set of corrupted examples (X(train)   , Y(train) )
               (train)
 1 for x in X          do
 2     if B(1, P ) then
 3         x̂ ← g
 4         where g is a vector sampled from N [µx , σx ]
 5     else
 6         x̂ ← x


 Algorithm 3: Structured input corruption
   Input: A train set of labelled examples (X(train) , Y(train) ), a set of possible
           labels C, and a probability value P
                                                   ˆ
   Output: A train set of corrupted examples (X(train)   , Y(train) )
               (train)
 1 for x in X          do
 2     if B(1, P ) then
 3         x̂ ← invert(rotate(x))
 4     else
 5         x̂ ← x

 6 rotate is a 90◦ rotation counter clockwise about the origin
 7 invert is an inversion of all values in the vector


C    Additional results

The mean cosine similarities and mean polarization values for analyses conducted
on the FMNIST data set are provided in Fig. 6 and 7, respectively. Notice that
most of the same observations can be made when compared to the MNIST results
in section 4.2 and 4.3. It is, however, worth noting that for a classification task
                      Insights regarding overfitting on noise in deep learning      15

with more overlap in the input space such as FMNIST, the well-separated sample
sets are generated at even later layers and the polarization is higher overall.


     (a) label corruption   (b) gaussian input cor-     (c) structured input cor-
                                ruption                     ruption

Fig. 6: Mean cosine similarity per-layer at varying levels of three types of noise.
The horizontal axis represents the probability of any given training sample hav-
ing been corrupted for the relevant model. Results are measured on the training
samples of the FMNIST data set.


     (a) label corruption   (b) gaussian input cor-     (c) structured input cor-
                                ruption                     ruption

Fig. 7: Mean per-layer corruption polarization at varying levels of three types
of noise. The horizontal axis represents the probability of any given training
sample having been corrupted for the relevant model. Results are measured on
the training samples of the FMNIST data set.

</pre>