A Comparison of Regularization Techniques for Shallow Neural Networks
                         Trained on Small Datasets

                                           Jiří Tumpach1,2 , Jan Kalina2 , and Martin Holeňa2
                                     1
                                  Charles University, Faculty of Mathematics and Physics, Prague,
        2   The Czech Academy of Sciences, Institute of Computer Science, Prague, {tumpach,kalina,martin}@cs.cas.cz

Abstract: Neural networks are frequently used as regres-                     In this paper we are interested in these areas where the
sion models. Their training is usually difficult when the                 network should struggle because we intend to use neu-
model is subject to a small training dataset with numerous                ral networks as approximation for surrogate modeling in
outliers.                                                                 black-box optimisation. Surrogate models are local mod-
   This paper investigates the effects of various regularisa-             els that estimate an unknown function in order to select
tion techniques that can help with this kind of problem. We               better candidates for evaluation and in this way reduce the
analysed the effects of the model size, loss selection, L2                cost of optimisation.
weight regularisation, L2 activity regularisation, Dropout,                  That motivated the research reported in this paper, in
and Alpha Dropout.                                                        which we have investigated 5 different regularisation tech-
   We collected 30 different datasets, each of which has                  niques in different configurations on 30 different datasets.
been split by ten-fold cross-validation. As an evaluation
metric, we used cumulative distribution functions (CDFs)
of L1 and L2 losses to aggregate results from different                   2     Methods
datasets without a considerable amount of distortion. Dis-
                                                                          2.1   Regularisation
tributions of the metrics are shown, and thorough statisti-
cal tests were conducted.                                                 Regularisation is a broad term used for methods that add
   Surprisingly, the results show that Dropout models are                 some new prior belief [2, 3] to a specific machine learning
not suited for our objective. The most effective approach is              method. The belief should redefine the problem to achieve
the choice of model size and L2 types of regularisations.                 a solution modified in the sense described by Occam’s ra-
                                                                          zor principle [8, 9] – more complex hypotheses are less
                                                                          likely than simple hypothesis. For example, the lasso/ridge
1    Introduction
                                                                          shrinkage methods in linear regression add a new term that
                                                                          makes small values more suitable as the solution to a prob-
Neural networks are nature-inspired regression models in-
                                                                          lem [2].
creasingly important in machine learning. This type of
model excels in predictive power but it has a poor robust-
ness to outliers, if the training dataset has a small number              The networks size regularisation As with many other re-
of samples, the target function is complicated, or the net-               gression models, the number of free parameters has critical
work is over-parametrized for the problem [1, 2, 3].                      consequence [3]. Models with a small number of free pa-
    On the other hand, novel theoretical analyses show dif-               rameters can handle only simple relationships, while large
ferent perspective on neural networks. In [4, 5] the authors              models can be more flexible. On the other hand, a large
investigate the effect of priors over weights for infinitely              model needs more samples in order to achieve more re-
wide single layer neural network and show that a Gaus-                    liable predictions for all parameters. If that requirement
sian prior results in a Gaussian process prior over its func-             is not met, the regression can over-fit – the model finds
tions. The Gaussian process is a smooth non-parametric                    some non-sensible but possible relationships in the train-
model well known for its generalisation properties, so it                 ing dataset, which not valid in general.
leads to the conjecture that there is no need to avoid over-
fitting of such a network. That idea was further generalised              Weight regularisation One possible solution to the free
to two-layer neural networks in [6] and general deep neu-                 parameters problem is a restriction of parameter domains.
ral networks in [7]. Experiments in [7] show that finite-                 In neural networks, it is done using weight regularisation.
width neural networks approach the infinite counterparts                  In fact, the domains of the parameters are unchanged, but
through increasing their width. The authors of [7] further                the probability of larger values is strongly reduced because
pointed out that Dropout could be an interesting potential                of an alternation of an optimisation objective. For exam-
improvement.                                                              ple, the L2 type of the weight regularisation adds new term
                                                                          Lw2 to the loss of particular network. It is defined as
     Copyright ©2021 for this paper by its authors. Use permitted under                           Lw2 = ∑ w2i                     (1)
Creative Commons License Attribution 4.0 International (CC BY 4.0).                                       i
Where wi stands for the value of the i-th parameter. It is                             Input               Hidden        Hidden
usually applied only to weights, not to biases.                                        layer                layer        layer 2
                                                                                 I1
Activity regularisation The third type of regularisation                                                        H1
is the activity regularisation [10]. In short, it penalises big
values coming from neurons. The effect may seem similar                          I2
to weight regularisation, but it may have more potential in
cases where the size of a layer is large enough. The rea-
                                                                                 I3
                                                                                                               ..           ..
son is that the weighted activities could count up to large
numbers in spite of small values of the weights. Activ-
ity regularisation is a way of making the input information                             ..                      .
                                                                                                                Hn           .
denser which is a nice property that is commonly utilised
in autoencoder-type neural networks [3, 10].
                                                                                 In
                                                                                         .
Dropout Dropout technique essentially mimics the bag-
ging technique, which is regularly used for improving gen-
eralisations of multiple models by aggregating the results
[11, 12, 3].
   When Dropout is applied to a specific layer, the training           Figure 1: Dropout regularisation. The red circle depicts
and testing phases differ. When the model is in the training           the neuron where Dropout regularisation causes the out-
stage, the results are randomly dropped – replaced with                put to be masked – in the current iteration, the output is
zeros. Therefore the next layer is forced to adapt to this             set to zero, and all following layers are computed as nor-
incomplete information. In the testing phase, the random               mal. This effect causes the updates of immediate incom-
sampling is replaced by a multiplicative constant in order             ing connections of that neuron to be zero, but other up-
to maintain mean values of activation for the next layer1 .            dates can still modify all other previous weights through
   Consequently, it increases the robustness of the model              non-dropped neurons. In this case, all red edges represent
and does not require any other model to train. The main                changes brought by gradients from other neurons that may
difference compared with bagging is that the models in                 influence the dropped neuron in the following iterations.
Dropout are dependent – they share weights. Such a shar-
ing is illustrated in Figure 1.
                                                                       as:
                                                                                     1    |D|
Alpha Dropout Standard Dropout is suited for rectified                   MSE(D) =       ∑ i=1
                                                                                              (yi − ŷi )2                         (2)
                                                                                   2|D|
linear units because zero is the default value of this ac-
                                                                                    1    |D|
tivation [11]. Alpha Dropout is a slight modification for               MAE(D) =             |yi − ŷi |                           (3)
smoother activation functions. It deals not only with the                          |D| ∑i=1
mean, but also with the variance. It is based on main-                               1    |D|
                                                                                               min (yi − ŷi )2 , 2|yi − ŷi | − 1 ,
                                                                                                                                  
                                                                        Huber(D) =      ∑ i=1
taining a walking average of neurons’ outputs and scaling                          2|D|
them accordingly.                                                                                                                  (4)

                                                                       where D is the dataset on which the loss is calculated, |D|
2.2     Loss functions                                                 is its size, yi is the target value of the i-th sample, and ŷi is
                                                                       its prediction.
A crucial part of any machine learning model selection is
                                                                          It is common to assume that the dataset is outlier-free
the definition of a loss function (prediction error measure,
                                                                       and normally distributed; therefore the MSE is the first
performance measure) [13, 2]. The loss function should
                                                                       choice regarding the selection of the loss function.
be fast, convex and should match the random noise that
                                                                          MAE/Huber losses are good replacement whenever the
can be found in the data. Frequently, the Mean Absolute
                                                                       data are known to have outliers, or the MSE has not per-
Error (MAE) and Mean Square Error (MSE) functions are
                                                                       formed well for an unknown reason.
selected because the corresponding noise is additive and
generated by Laplacian and Gaussian distribution, respec-
tively. In addition, also the Huber loss function (cf. [14])           Robust loss functions A common way of dealing with
is commonly used. These three loss functions are defined               outliers is to remove them from training data or choose an
                                                                       entirely different model2 [2].
                                                                          Even though the outlier removal has been thoroughly
                                                                       studied, the exact definition of an outlier highly depends
      1 An alternative is to use the constant in the training phase.         2 Like k-NN or regression tree.
on the problem we want to solve. There exists definitions              Table 1: All datasets considered in the analysis.
of an outlier relying on median absolute deviation [15],
                                                                        Name                            no. features   no. samples
quantile and medoid [16], online Kalman filter [17] or                  Concrete Compressive Strength             9          1030
nearest neighbour based filtering [18].                                 The Boston Housing                       14           506
   A different approach is proposed in [19] and improved                Auto MPG                                  8           398
                                                                        Proben1 (3d reg.)                         6          4208
in [20] where authors deal with robust linear regression by             Misra1a                                   2            14
removing the most prominent residuals in the loss func-                 Chwirut2                                  3            54
                                                                        Chwirut1                                  3           214
tion. That idea was further adapted for neural networks                 Lanczos3                                  6            24
in [21] or nonlinear regression with a known regression                 Gauss1                                    8           250
                                                                        Gauss2                                    8           250
function in [22]. Essentially, these methods exploit the                DanWood                                   2             6
idea that neural networks can learn algorithms (hypothe-                Kirby2                                    5           151
sis). With an assumption that more complex algorithms                   Hahn1                                     7           236
                                                                        Nelson                                    3           128
are harder to learn, the prior belief that reduces the proba-           MGH17                                     5            33
bility of more complex hypotheses also serves as an outlier             Lanczos1                                  6            24
                                                                        Lanczos2                                  6            24
removal tool.                                                           Gauss3                                    8           250
   Extensions Least Trimmed Squares (LTS) and Least                     Roszman1                                  4            25
                                                                        ENSO                                      9           168
Trimmed Absolute Deviations (LTA) of MSE (2) and                        MGH09                                     4            11
MAE (3) that follow the approach recalled in the previous               Thurber                                   7            37
paragraph and we used them in out analysis are defined in               BoxBOD                                    2             6
                                                                        Rat42                                     3             9
the following way                                                       MGH10                                     3            16
                                                                        Eckerle4                                  3            35
                       1       |D|                                      Rat43                                     4            15
                                   ρ (yi − ŷi )2
                                                    
          LTS(D) =          ∑  i=1
                                                           (5)          Bennett5                                  3           154
                    0.9|D|
                       1       |D|
        LTA(D) =                   ρ (|yi − ŷi |) ,       (6)
                    0.9|D| ∑i=1                                      Hyperparameter              Considered values
                                                                           Loss                 MSE, MAE, Huber, LTS, LTA
                 xi if less than 90 % of residuals                 Size of the first layer       4, 8, 16, 32, 64, 128, 256, 512,
where ρ(xi ) =
                 0 otherwise                                                                     1024, 2048, 4096
                                                                   Model regularisation          No regularisation,
3     Methodology                                                                                50% Dropout,
                                                                                                 50% Alpha Dropout
3.1   Datasets and their preparation
                                                                       Table 2: Options for the first set of experiments.
We selected 30 datasets containing a relatively small num-
ber of samples. These are real-world as well as artificially
generated publicly available datasets, for which a nonlin-        was mapped by the corresponding ECDF, creating normal-
ear regression model (i.e. explaining a given variable as a       ized order of results in a particular bin. Finally, all results
response against predictors under uncertainty) is a mean-         are combined back together.
ingful task. The list of the 30 datasets is presented in Ta-
                                                                     To compare normalized results for a specific hyperpa-
ble 1. Only datasets without missing values were selected.
                                                                  rameter, we split combined results by the value. These
A ten-fold validation has been employed in order to ob-
                                                                  splits create empirical distributions of normalized results,
tain more reliable results. If the dataset had less than ten
                                                                  which a violin plot can reasonably visualize.
samples, we used leave-one-out cross-validation instead.
Each feature was standardised according to training data
in a specific fold.
                                                                   Hyperparameter                Considered values
                                                                   Loss                          MSE, MAE, Huber
3.2   Aggregation of results                                       Size of the first layer       4, 8, 16, 32, 64, 128, 256, 512,
It is not possible to visualize the results of regression meth-                                  1024, 2048, 4096, 8192
ods across multiple datasets and loss functions. For ex-                                         L2-weight – 0.001, 0.01, 0.1,
ample, some datasets are easier than others, and one loss          Model regularisation          1, 10
function highlights outliers more, so most of the loss is                                        L2-activity – 0.001, 0.01, 0.1,
made of one sample.                                                                              1, 10
    We tackle this problem by separating the specific                                            Alpha dropout – 0.1, 0.2, 0.4,
dataset, fold, and function in a separate bin. In this bin,                                      0.6, 0.8
we learn the order of results creating empirical cumulative
distribution function (ECDF). Every result in a specific bin         Table 3: Options for the second set of experiments.
3.3   Statistical tests                                          causes the output to be stochastic, so the error is stochas-
                                                                 tic too. The stochastic error can cause accidental results,
We have used only non-parametric blocking statistical            which can stop the training prematurely. Because we have
tests because the results have limited values, and we            one mini-batch, the variance is too significant not to be
wanted to utilise as much information as possible. The           perceptible.
Friedman test was used to decide whether a particular view          Though not tried in our experiments, a possible remedy
on some hyperparameter includes is drawn from the same           could be early stopping variation where the error is expo-
distribution or not. At this point, the ECDF mapping is          nentially smoothed.
not needed because the test is non-parametric. All statisti-
cal tests use the usual 5% significance level.
   Multiple comparison tests were done using Wilcoxon            4.2   Size of models
signed-rank test [23] with Holm correction [24] instead of       In the second and third experiments, we were interested
mean-ranks post-hoc tests which can create inconsisten-          in network size and its effect on performance. Figure 3a
cies and paradoxical situations in machine learning sce-         and Table 4 show non-regularised models and Figure 3b
narios [25].                                                     and Table 5 show the Dropout variants combined together.
                                                                 Non-regularized results are better than the Dropout vari-
3.4   Architecture                                               ants, which are less stable and have delayed response on
                                                                 the increase of network size.
We selected three-layered architecture. The first layer has         The stability may come from the same source as the pre-
T neurons, the second layer has always T/2 and the third         vious problem - the early stopping could make the model
layer always has one neuron. The first two layers have           undertrained. The delay may be the result of the selected
Scaled Exponential Linear Units (SELU) as an activation          dropout rate. Because we used a dropout rate of 50%,
function, and the third layer has a linear function.             the real amount of usable information can be effectively
                                                                 halved in each hidden layer (given that there is no space
3.5   Training                                                   or resources to make the information denser). Together it
                                                                 is a 4x delay which is not enough to explain the findings
We trained our models with a NAdam optimiser with a              (the optimum size of the model is 24 vs 27 ). Possible other
0.001 learning rate. We use early stopping with patience =       reasons could be
10 and delta = 1e−10 to speed up the training. Even though
this is another type of regularisation, we use it in such a        • the difficulty of encoding uncertain patterns
manner that its effect is minuscule. The maximal number            • undertraining, due to early stopping
of epochs is set to 10000, and batch size is equivalent to the
size of the largest dataset. In the first set of experiments,
                                                                 4.3   Loss function
we produced 48 014 neural networks and their results. In
the second set, we managed to prepare 178 092 models.            In the fourth experiment, we analyzed the effect of a
                                                                 loss function for models without regularisation. Trimmed
                                                                 variants performed poorly probably because they remove
4     Results
                                                                 some residuals (10%) and, therefore, reduce dataset size
4.1   Dropout regularisation                                     even more. In our case, Mean Squared Error (MSE) is
                                                                 better fitted than Mean Average Error (MAE). From the
In the first experiment, we compared between 3 set-              distribution in Figure 4 it seems that MSE has much worse
tings – no regularisation, dropout regularisation and, alpha     results, but the median value (shown as the white point in
dropout regularisation. Both dropout techniques are set          the central part of the graph) of MSE is better than that of
to 50% probability. The results are in Figure 2, number of       MAE. The best loss function is the Huber loss. All results
models that are better than the same hyperparameter coun-        can be seen in Table 6.
terpart can be seen in Table 2b for L1 loss and Table 2c            The Huber loss combines benefits of both worlds be-
for L2 loss. We highlighted in bold values that Wilcoxon         cause its derivatives are dependent on the size of error
signed-rank test with Holm correction found significantly        (from MSE) while limiting the maximum value (from
better than the column value.                                    MAE). This effect may be responsible for the best result
    It seems that the regularisation does not help. It may       among the considered loss functions.
be caused by the exaggerated value of the Dropout rate
or a need for such models to have wider layers. We do not        4.4   Weight regularisation
know the reason why Alpha Dropout performed so badly –
it should be better because we used SELU as an activation        The weight regularisation has a prominent effect on the re-
function.                                                        sults, as revealed in Figure 5. Too much is certainly worse
    One possible explanation for this poor performance is        than no weight normalisation, but suitable values signifi-
the use of early stopping. In the training phase, the dropout    cantly reduce bad results.
              (a) Distributions of scaled results by empirical cumulative distribution functions for each dataset and loss separately.

                          No reg.         Dropout    Alpha Dropout                                      No reg.        Dropout        Alpha Dropout
           No reg.                         10898         14070                       No reg.                            10904             14020
           Dropout         4557                          12360                       Dropout             4551                             12352
        Alpha Dropout      1385             3095                                  Alpha Dropout          1435           3103
                (b) Statistical tests for L1 loss function                                    (c) Statistical tests for L2 loss function

Figure 2: The first experiment compared different kinds of dropout regularizers across all different combinations of
datasets and hyperparameters. The tables display the number of experiments where one model size (row) is better than
the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column
is better than row), the value is highlighted in bold. The statistical tests clearly prefer models without dropout.

                                      4        8    16       32     64      128      256      512      1024     2048      4096
                             4                580   518      604    683     767      863       975     1047     1120      1177
                             8        825           636      707    777     827      921      1045     1129     1213      1238
                            16        887     769            797    859     871      978      1096     1184     1253      1275
                            32        801     698   608             818     878      999      1122     1197     1255      1270
                            64        722     628   546      587            885      1046     1148     1222     1268      1278
                           128        638     578   534      527    520              1031     1164     1229     1263      1282
                           256        542     484   427      406    359     374                999     1151     1189      1232
                           512        430     360   309      283    257     241      406               1042     1075      1187
                          1024        358     276   221      208    183     176      254      363                961      1072
                          2048        285     192   152      150    137     142      216      330      444                 973
                          4096        228     167   130      135    127     123      173      218      333      432


Table 4: The second experiment compares L1 loss for models of different sizes without regularization. The table displays
the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon
signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in
bold.

                                  4          8       16      32       64      128      256      512      1024      2048        4096
                           4                1170    948      737      717     795       870     1030     1285      1682        2087
                           8     1640               1089     802      759     849       926     1088     1412      1855        2197
                          16     1862       1721             982      892     932      1024     1210     1654      2065        2373
                          32     2073       2008    1828             1268     1255     1247     1490     1967      2335        2542
                          64     2093       2051    1918     1542             1347     1355     1620     2111      2438        2588
                         128     2015       1961    1878     1555    1463              1406     1740     2193      2453        2614
                         256     1940       1884    1786     1563    1455     1404              1792     2280      2514        2623
                         512     1780       1722    1600     1320    1190     1070     1018              2127      2417        2585
                        1024     1525       1398    1156     843      699     617       530      683               2083        2411
                        2048     1128        955    745      475      372     357       296      393      727                  2050
                        4096     723         613    437      268      222     196       187      225      399      760


Table 5: The third experiment compares L1 loss for models of different sizes with Dropout or Alpha Dropout regular-
ization. The table displays the number of experiments where one model size (row) is better than the other (column). If
the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the
value is highlighted in bold.
                                       (a) Distributions of test losses for non-regularized models.


                        (b) Distributions of test losses for Dropout and Alpha Dropout models combined together.

Figure 3: The second and third experiment analyses the effect of model size on regularized and non-regularized models.
The regularisation delays over-training but does not improve the results.


Figure 4: The results of the fourth experiment – comparison between loss functions for non-regularized models. MSE has
a lot of good and bad results; Huber seems to be the best; trimmed losses are the worst.
     Figure 5: The fifth experiment exposes the effect of L2 normalisation weight.


 Figure 6: The sixth experiment exposes the effect of L2 activity normalisation weight.


Figure 7: The seventh experiment exposes the effect of Alpha Dropout rate (probability).
                                                 Huber


                                                                MAE


                                                                         MSE


                                                                                  LTA


                                                                                              LTS
                                     Huber                  1854        1614     2111     2184
                                     MAE        1237                    1341     2125     2171
                                     MSE        1477        1750                 2006     2085
                                      LTA       980          966        1085              1750
                                      LTS       907          920        1006     1341

Table 6: The fourth experiment compares L1 losses for models trained by optimization of different loss functions without
regularization. The table displays the number of experiments where one model size (row) is better than the other (column).
If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row),
the value is highlighted in bold.


                                          0.0       0.001             0.01     0.1      1.0         10.0
                                 0.0                 4430             3705     5053     6260        8348
                                0.001    5686                         4364     5735     6987        8967
                                 0.01    6411            5752                  7086     8278        9601
                                 0.1     5063            4381         3030              8325        9686
                                 1.0     3856            3129         1838     1790                 9449
                                 10.0    1768            1149          515     430      647

Table 7: The fifth experiment exposes the effect of L2 normalisation weight. The table displays the number of experiments
where one weight of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test
with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.


                                          0.0       0.001             0.01     0.1      1.0         10.0
                                 0.0                 5197             4895     4813     5787        7044
                                0.001    4919                         4921     4901     5833        7115
                                 0.01    5221            5195                  5237     6352        7654
                                 0.1     5303            5215         4879              7313        8509
                                 1.0     4329            4283         3764     2803                 8631
                                 10.0    3072            3001         2462     1607     1485

Table 8: The sixth experiment exposes the effect of L2 activity normalisation weight. The table displays the number of
experiments where one weight of the regularization (row) is better than the other (column). If the one-sided Wilcoxon
signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in
bold.


                                              0.1          0.2         0.4      0.6      0.8
                                      0.1                  7226        7654     7614     7275
                                      0.2    2890                      7075     6813     6335
                                      0.4    2462          3041                 6137     5684
                                      0.6    2502          3303        3979              5449
                                      0.8    2841          3781        4432     4667

Table 9: The seventh experiment exposes the effect of Alpha Dropout rate (probability). The table displays the number
of experiments where one rate of the regularization (row) is better than the other (column). If the one-sided Wilcoxon
signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in
bold.
  If the regularisation is too high, the loss is effectively re-   References
placed only with the term that reduces weights on the net-
work’s connections. If it is too low, the network can lack          [1] Steve Lawrence, Clyde Lee Giles, and Ah Chung Tsoi.
regularisation – creating potentially volatile responses.               What size neural network gives optimal generalization?
                                                                        convergence properties of backpropagation. Technical re-
                                                                        port, Institute for Advanced Computer Studies University
4.5   Activity regularisation                                           of Maryla, 1996.
                                                                    [2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
                                                                        The elements of statistical learning: data mining, inference
In our case, the effect of activity regularisation is similar
                                                                        and prediction. Springer, 2 edition, 2009.
but smaller than the weight penalty. The difference in
                                                                    [3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
weight and activity regularisation effectiveness can be ex-
                                                                        Deep Learning. MIT Press, 2016.
plained by the specific activation used in training. The
                                                                    [4] Radford M. Neal. Priors for infinite networks. In Bayesian
results are in Figure 6 and in Table 8.
                                                                        Learning for Neural Networks, pages 29–53. Springer,
                                                                        1996.
                                                                    [5] Christopher K. I. Williams. Computing with infinite net-
4.6   Alpha Dropout rate
                                                                        works. Advances in neural information processing systems,
                                                                        pages 295–301, 1997. morgan kaufmann publishers.
In Figure 7 and Table 9 the effects of Alpha Dropout rate           [6] Tamir Hazan and Tommi Jaakkola. Steps toward deep ker-
can be seen. It may be good to investigate smaller values               nel methods from infinite neural networks. arXiv preprint
more because the 0.1 rate is the best. The preference for               arXiv:1508.05133, 2015.
not having this regularisation can be explained equally as          [7] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S.
in the subsection 4.1.                                                  Schoenholz, Jeffrey Pennington, and Jascha Sohl-
                                                                        Dickstein. Deep Neural Networks as Gaussian Processes,
                                                                        2017. _eprint: 1711.00165.
5     Conclusion                                                    [8] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and
                                                                        Manfred K Warmuth. Occam’s razor. Information process-
                                                                        ing letters, 24(6):377–380, 1987. Publisher: Elsevier.
In this paper, we analyzed several types of regularisation
                                                                    [9] Carl Edward Rasmussen and Zoubin Ghahramani. Occam’s
techniques on databases where effective hyperparameter
                                                                        razor. Advances in neural information processing systems,
optimization is not possible due to the lack of samples                 pages 294–300, 2001. Publisher: MIT; 1998.
or the existence of outliers in the database. We showed
                                                                   [10] Jason Brownlee. Better Deep Learning: Train Faster, Re-
that Dropout techniques in these scenarios are not a good               duce Overfitting, and Make Better Predictions. Machine
choice because their results are not stable enough to com-              Learning Mastery, 2018.
pete with models without regularisation. The model’s size          [11] Günter Klambauer, Thomas Unterthiner, Andreas Mayr,
is an essential aspect, and it seems that the optimum has a             and Sepp Hochreiter. Self-Normalizing Neural Networks.
far bigger number of free parameters than the theoretical               arXiv:1706.02515 [cs, stat], September 2017. arXiv:
number computed using the average across our training                   1706.02515.
databases. Huber loss function is the best because it does         [12] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
not suffer from inconsistencies of MAE or MSE losses.                   Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple
Trimmed variants of loss functions [21] performed poorly                Way to Prevent Neural Networks from Overfitting. Journal
here, but they may be better if a particular dataset has more           of Machine Learning Research, 15:1929–1958, June 2014.
samples than we had. The third best hyperparameter to              [13] Bernhard Scholkopf and Alexander J. Smola. Learning
look for is the weight normalization – small weight dra-                with Kernels: Support Vector Machines, Regularization,
matically reduces the frequency of bad results while keep-              Optimization, and Beyond. MIT Press, Cambridge, MA,
ing the median of results low.                                          USA, 2001.
                                                                   [14] P. J. Huber. Robust statistics. Wiley, New York, 2nd edition,
                                                                        2009.
6     Acknowledgement                                              [15] Frank R. Hampel. The influence curve and its role in robust
                                                                        estimation. Journal of the american statistical association,
                                                                        69(346):383–393, 1974. Publisher: Taylor & Francis.
The research reported in this paper has been supported             [16] Irad Ben-Gal. Outlier detection. In Data mining and knowl-
by SVV project number 260 575 and partially supported                   edge discovery handbook, pages 131–146. Springer, 2005.
by the Czech Science Foundation (GA ČR) projects 18-              [17] Hancong Liu, Sirish Shah, and Wei Jiang. On-line outlier
18080S and 19-05704S.                                                   detection and data cleaning. Computers & Chemical Engi-
  Computational resources were supplied by the project                  neering, 28(9):1635–1647, 2004.
"e-Infrastruktura CZ" (e-INFRA LM2018140) provided                 [18] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim.
within the program Projects of Large Research, Develop-                 Efficient Algorithms for Mining Outliers from Large Data
ment and Innovations Infrastructures.                                   Sets. In Proceedings of the 2000 ACM SIGMOD Inter-
     national Conference on Management of Data, SIGMOD
     ’00, pages 427–438, New York, NY, USA, 2000. Associa-
     tion for Computing Machinery. event-place: Dallas, Texas,
     USA.
[19] Peter J Rousseeuw. Least median of squares regres-
     sion. Journal of the American statistical association,
     79(388):871–880, 1984. Publisher: Taylor & Francis.
[20] Peter Rousseeuw and Katrien Van Driessen. Computing
     LTS regression for large data sets. Data mining and knowl-
     edge discovery, 12(1):29–45, 2006. Publisher: Springer.
[21] J. Kalina and P. Vidnerová. Robust Multilayer Perceptrons:
     Robust Loss Functions and Their Derivatives. In L. Iliadis,
     P. Angelov, C. Jayne, and E. Pimenidis, editors, Proceed-
     ings of the 21st EANN (Engineering Applications of Neural
     Networks) 2020 Conference, pages 546–557, Cham, 2020.
     Springer. event-place: Cham.
[22] J. Kalina, A. Neoral, and P. Vidnerová. Effective automatic
     method selection for nonlinear regression modelling. Inter-
     national Journal of Neural Systems, 2021.
[23] Frank Wilcoxon. Individual Comparisons by Ranking
     Methods. Biometrics Bulletin, 1(6):80–83, 1945. Pub-
     lisher: JSTOR.
[24] Sture Holm. A simple sequentially rejective multiple test
     procedure. Scandinavian journal of statistics, pages 65–
     70, 1979. Publisher: JSTOR.
[25] Alessio Benavoli, Giorgio Corani, and Francesca Mangili.
     Should We Really Use Post-Hoc Tests Based on Mean-
     Ranks? Journal of Machine Learning Research, 17(5):1–
     10, 2016.