<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comparison of Regularization Techniques for Shallow Neural Networks Trained on Small Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jirˇí Tumpach</string-name>
          <email>tumpach@cs.cas.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Kalina</string-name>
          <email>kalina@cs.cas.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holenˇ a</string-name>
          <email>martin@cs.cas.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University, Faculty of Mathematics and Physics</institution>
          ,
          <addr-line>Prague</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Czech Academy of Sciences, Institute of Computer Science</institution>
          ,
          <addr-line>Prague</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Neural networks are frequently used as regression models. Their training is usually difficult when the model is subject to a small training dataset with numerous outliers. This paper investigates the effects of various regularisation techniques that can help with this kind of problem. We analysed the effects of the model size, loss selection, L2 weight regularisation, L2 activity regularisation, Dropout, and Alpha Dropout. We collected 30 different datasets, each of which has been split by ten-fold cross-validation. As an evaluation metric, we used cumulative distribution functions (CDFs) of L1 and L2 losses to aggregate results from different datasets without a considerable amount of distortion. Distributions of the metrics are shown, and thorough statistical tests were conducted. Surprisingly, the results show that Dropout models are not suited for our objective. The most effective approach is the choice of model size and L2 types of regularisations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Neural networks are nature-inspired regression models
increasingly important in machine learning. This type of
model excels in predictive power but it has a poor
robustness to outliers, if the training dataset has a small number
of samples, the target function is complicated, or the
network is over-parametrized for the problem [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ].
      </p>
      <p>
        On the other hand, novel theoretical analyses show
different perspective on neural networks. In [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] the authors
investigate the effect of priors over weights for infinitely
wide single layer neural network and show that a
Gaussian prior results in a Gaussian process prior over its
functions. The Gaussian process is a smooth non-parametric
model well known for its generalisation properties, so it
leads to the conjecture that there is no need to avoid
overfitting of such a network. That idea was further generalised
to two-layer neural networks in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and general deep
neural networks in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Experiments in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] show that
finitewidth neural networks approach the infinite counterparts
through increasing their width. The authors of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] further
pointed out that Dropout could be an interesting potential
improvement.
      </p>
      <p>In this paper we are interested in these areas where the
network should struggle because we intend to use
neural networks as approximation for surrogate modeling in
black-box optimisation. Surrogate models are local
models that estimate an unknown function in order to select
better candidates for evaluation and in this way reduce the
cost of optimisation.</p>
      <p>That motivated the research reported in this paper, in
which we have investigated 5 different regularisation
techniques in different configurations on 30 different datasets.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>Regularisation</title>
        <p>
          Regularisation is a broad term used for methods that add
some new prior belief [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ] to a specific machine learning
method. The belief should redefine the problem to achieve
a solution modified in the sense described by Occam’s
razor principle [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ] – more complex hypotheses are less
likely than simple hypothesis. For example, the lasso/ridge
shrinkage methods in linear regression add a new term that
makes small values more suitable as the solution to a
problem [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          The networks size regularisation As with many other
regression models, the number of free parameters has critical
consequence [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Models with a small number of free
parameters can handle only simple relationships, while large
models can be more flexible. On the other hand, a large
model needs more samples in order to achieve more
reliable predictions for all parameters. If that requirement
is not met, the regression can over-fit – the model finds
some non-sensible but possible relationships in the
training dataset, which not valid in general.
        </p>
        <p>Weight regularisation One possible solution to the free
parameters problem is a restriction of parameter domains.
In neural networks, it is done using weight regularisation.
In fact, the domains of the parameters are unchanged, but
the probability of larger values is strongly reduced because
of an alternation of an optimisation objective. For
example, the L2 type of the weight regularisation adds new term
Lw2 to the loss of particular network. It is defined as
Where wi stands for the value of the i-th parameter. It is
usually applied only to weights, not to biases.</p>
        <p>
          Activity regularisation The third type of regularisation
is the activity regularisation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In short, it penalises big
values coming from neurons. The effect may seem similar
to weight regularisation, but it may have more potential in
cases where the size of a layer is large enough. The
reason is that the weighted activities could count up to large
numbers in spite of small values of the weights.
Activity regularisation is a way of making the input information
denser which is a nice property that is commonly utilised
in autoencoder-type neural networks [
          <xref ref-type="bibr" rid="ref10 ref3">3, 10</xref>
          ].
        </p>
        <p>
          Dropout Dropout technique essentially mimics the
bagging technique, which is regularly used for improving
generalisations of multiple models by aggregating the results
[
          <xref ref-type="bibr" rid="ref11 ref12 ref3">11, 12, 3</xref>
          ].
        </p>
        <p>When Dropout is applied to a specific layer, the training
and testing phases differ. When the model is in the training
stage, the results are randomly dropped – replaced with
zeros. Therefore the next layer is forced to adapt to this
incomplete information. In the testing phase, the random
sampling is replaced by a multiplicative constant in order
to maintain mean values of activation for the next layer1.</p>
        <p>Consequently, it increases the robustness of the model
and does not require any other model to train. The main
difference compared with bagging is that the models in
Dropout are dependent – they share weights. Such a
sharing is illustrated in Figure 1.</p>
        <p>
          Alpha Dropout Standard Dropout is suited for rectified
linear units because zero is the default value of this
activation [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Alpha Dropout is a slight modification for
smoother activation functions. It deals not only with the
mean, but also with the variance. It is based on
maintaining a walking average of neurons’ outputs and scaling
them accordingly.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Loss functions</title>
        <p>
          A crucial part of any machine learning model selection is
the definition of a loss function (prediction error measure,
performance measure) [
          <xref ref-type="bibr" rid="ref13 ref2">13, 2</xref>
          ]. The loss function should
be fast, convex and should match the random noise that
can be found in the data. Frequently, the Mean Absolute
Error (MAE) and Mean Square Error (MSE) functions are
selected because the corresponding noise is additive and
generated by Laplacian and Gaussian distribution,
respectively. In addition, also the Huber loss function (cf. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ])
is commonly used. These three loss functions are defined
I1
I2
I3
In
        </p>
        <sec id="sec-2-2-1">
          <title>Input</title>
          <p>layer
.
.
.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Hidden layer</title>
          <p>H1</p>
          <p>Hn
.
.
.
.
.
.
åjiD=1j min (yi yˆi)2; 2jyi yˆij
(2)
(3)
1 ;
(4)
where D is the dataset on which the loss is calculated, jD j
is its size, yi is the target value of the i-th sample, and yˆi is
its prediction.</p>
          <p>It is common to assume that the dataset is outlier-free
and normally distributed; therefore the MSE is the first
choice regarding the selection of the loss function.</p>
          <p>MAE/Huber losses are good replacement whenever the
data are known to have outliers, or the MSE has not
performed well for an unknown reason.</p>
          <p>
            Robust loss functions A common way of dealing with
outliers is to remove them from training data or choose an
entirely different model2 [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ].
          </p>
          <p>Even though the outlier removal has been thoroughly
studied, the exact definition of an outlier highly depends
1An alternative is to use the constant in the training phase.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2Like k-NN or regression tree.</title>
          <p>
            on the problem we want to solve. There exists definitions
of an outlier relying on median absolute deviation [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ],
quantile and medoid [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], online Kalman filter [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] or
nearest neighbour based filtering [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ].
          </p>
          <p>
            A different approach is proposed in [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] and improved
in [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] where authors deal with robust linear regression by
removing the most prominent residuals in the loss
function. That idea was further adapted for neural networks
in [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] or nonlinear regression with a known regression
function in [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ]. Essentially, these methods exploit the
idea that neural networks can learn algorithms
(hypothesis). With an assumption that more complex algorithms
are harder to learn, the prior belief that reduces the
probability of more complex hypotheses also serves as an outlier
removal tool.
          </p>
          <p>Extensions Least Trimmed Squares (LTS) and Least
Trimmed Absolute Deviations (LTA) of MSE (2) and
MAE (3) that follow the approach recalled in the previous
paragraph and we used them in out analysis are defined in
the following way</p>
          <p>LTS(D ) =
LTA(D ) =</p>
          <p>1 jDj r (yi
0:9jD j åi=1</p>
          <p>1 jDj r (jyi
0:9jD j åi=1
yˆi)2
yˆij) ;
(5)
(6)
where r (xi) =
xi
0
if less than 90 % of residuals
otherwise
3
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Datasets and their preparation</title>
        <p>We selected 30 datasets containing a relatively small
number of samples. These are real-world as well as artificially
generated publicly available datasets, for which a
nonlinear regression model (i.e. explaining a given variable as a
response against predictors under uncertainty) is a
meaningful task. The list of the 30 datasets is presented in
Table 1. Only datasets without missing values were selected.
A ten-fold validation has been employed in order to
obtain more reliable results. If the dataset had less than ten
samples, we used leave-one-out cross-validation instead.
Each feature was standardised according to training data
in a specific fold.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Aggregation of results</title>
        <p>It is not possible to visualize the results of regression
methods across multiple datasets and loss functions. For
example, some datasets are easier than others, and one loss
function highlights outliers more, so most of the loss is
made of one sample.</p>
        <p>We tackle this problem by separating the specific
dataset, fold, and function in a separate bin. In this bin,
we learn the order of results creating empirical cumulative
distribution function (ECDF). Every result in a specific bin
was mapped by the corresponding ECDF, creating
normalized order of results in a particular bin. Finally, all results
are combined back together.</p>
        <p>To compare normalized results for a specific
hyperparameter, we split combined results by the value. These
splits create empirical distributions of normalized results,
which a violin plot can reasonably visualize.</p>
        <sec id="sec-3-2-1">
          <title>Hyperparameter Loss Size of the first layer</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Model regularisation</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Considered values</title>
          <p>MSE, MAE, Huber
4, 8, 16, 32, 64, 128, 256, 512,
1024, 2048, 4096, 8192
L2-weight – 0.001, 0.01, 0.1,
1, 10
L2-activity – 0.001, 0.01, 0.1,
1, 10
Alpha dropout – 0.1, 0.2, 0.4,
0.6, 0.8
We have used only non-parametric blocking statistical
tests because the results have limited values, and we
wanted to utilise as much information as possible. The
Friedman test was used to decide whether a particular view
on some hyperparameter includes is drawn from the same
distribution or not. At this point, the ECDF mapping is
not needed because the test is non-parametric. All
statistical tests use the usual 5% significance level.</p>
          <p>
            Multiple comparison tests were done using Wilcoxon
signed-rank test [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] with Holm correction [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ] instead of
mean-ranks post-hoc tests which can create
inconsistencies and paradoxical situations in machine learning
scenarios [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ].
We selected three-layered architecture. The first layer has
T neurons, the second layer has always T=2 and the third
layer always has one neuron. The first two layers have
Scaled Exponential Linear Units (SELU) as an activation
function, and the third layer has a linear function.
3.5
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Training</title>
        <p>We trained our models with a NAdam optimiser with a
0.001 learning rate. We use early stopping with patience =
10 and delta = 1e 10 to speed up the training. Even though
this is another type of regularisation, we use it in such a
manner that its effect is minuscule. The maximal number
of epochs is set to 10000, and batch size is equivalent to the
size of the largest dataset. In the first set of experiments,
we produced 48 014 neural networks and their results. In
the second set, we managed to prepare 178 092 models.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Dropout regularisation</title>
        <p>In the first experiment, we compared between 3
settings – no regularisation, dropout regularisation and, alpha
dropout regularisation. Both dropout techniques are set
to 50% probability. The results are in Figure 2, number of
models that are better than the same hyperparameter
counterpart can be seen in Table 2b for L1 loss and Table 2c
for L2 loss. We highlighted in bold values that Wilcoxon
signed-rank test with Holm correction found significantly
better than the column value.</p>
        <p>It seems that the regularisation does not help. It may
be caused by the exaggerated value of the Dropout rate
or a need for such models to have wider layers. We do not
know the reason why Alpha Dropout performed so badly –
it should be better because we used SELU as an activation
function.</p>
        <p>One possible explanation for this poor performance is
the use of early stopping. In the training phase, the dropout
causes the output to be stochastic, so the error is
stochastic too. The stochastic error can cause accidental results,
which can stop the training prematurely. Because we have
one mini-batch, the variance is too significant not to be
perceptible.</p>
        <p>Though not tried in our experiments, a possible remedy
could be early stopping variation where the error is
exponentially smoothed.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Size of models</title>
        <p>In the second and third experiments, we were interested
in network size and its effect on performance. Figure 3a
and Table 4 show non-regularised models and Figure 3b
and Table 5 show the Dropout variants combined together.
Non-regularized results are better than the Dropout
variants, which are less stable and have delayed response on
the increase of network size.</p>
        <p>The stability may come from the same source as the
previous problem - the early stopping could make the model
undertrained. The delay may be the result of the selected
dropout rate. Because we used a dropout rate of 50%,
the real amount of usable information can be effectively
halved in each hidden layer (given that there is no space
or resources to make the information denser). Together it
is a 4x delay which is not enough to explain the findings
(the optimum size of the model is 24 vs 27). Possible other
reasons could be
• the difficulty of encoding uncertain patterns
• undertraining, due to early stopping
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Loss function</title>
        <p>In the fourth experiment, we analyzed the effect of a
loss function for models without regularisation. Trimmed
variants performed poorly probably because they remove
some residuals (10%) and, therefore, reduce dataset size
even more. In our case, Mean Squared Error (MSE) is
better fitted than Mean Average Error (MAE). From the
distribution in Figure 4 it seems that MSE has much worse
results, but the median value (shown as the white point in
the central part of the graph) of MSE is better than that of
MAE. The best loss function is the Huber loss. All results
can be seen in Table 6.</p>
        <p>The Huber loss combines benefits of both worlds
because its derivatives are dependent on the size of error
(from MSE) while limiting the maximum value (from
MAE). This effect may be responsible for the best result
among the considered loss functions.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Weight regularisation</title>
        <p>The weight regularisation has a prominent effect on the
results, as revealed in Figure 5. Too much is certainly worse
than no weight normalisation, but suitable values
significantly reduce bad results.</p>
        <p>(a) Distributions of scaled results by empirical cumulative distribution functions for each dataset and loss separately.</p>
        <p>No reg.</p>
        <p>Dropout
Alpha Dropout</p>
        <p>No reg.
is better than row), the value is highlighted in bold. The statistical tests clearly prefer models without dropout.
signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in
bold.
(a) Distributions of test losses for non-regularized models.</p>
        <p>(b) Distributions of test losses for Dropout and Alpha Dropout models combined together.
1237
1477
980
907
1750
966
920
1085
1006</p>
        <p>If the regularisation is too high, the loss is effectively
replaced only with the term that reduces weights on the
network’s connections. If it is too low, the network can lack
regularisation – creating potentially volatile responses.
In our case, the effect of activity regularisation is similar
but smaller than the weight penalty. The difference in
weight and activity regularisation effectiveness can be
explained by the specific activation used in training. The
results are in Figure 6 and in Table 8.
In Figure 7 and Table 9 the effects of Alpha Dropout rate
can be seen. It may be good to investigate smaller values
more because the 0.1 rate is the best. The preference for
not having this regularisation can be explained equally as
in the subsection 4.1.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        In this paper, we analyzed several types of regularisation
techniques on databases where effective hyperparameter
optimization is not possible due to the lack of samples
or the existence of outliers in the database. We showed
that Dropout techniques in these scenarios are not a good
choice because their results are not stable enough to
compete with models without regularisation. The model’s size
is an essential aspect, and it seems that the optimum has a
far bigger number of free parameters than the theoretical
number computed using the average across our training
databases. Huber loss function is the best because it does
not suffer from inconsistencies of MAE or MSE losses.
Trimmed variants of loss functions [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] performed poorly
here, but they may be better if a particular dataset has more
samples than we had. The third best hyperparameter to
look for is the weight normalization – small weight
dramatically reduces the frequency of bad results while
keeping the median of results low.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>The research reported in this paper has been supported
by SVV project number 260 575 and partially supported
by the Czech Science Foundation (GA CˇR) projects
1818080S and 19-05704S.</p>
      <p>Computational resources were supplied by the project
"e-Infrastruktura CZ" (e-INFRA LM2018140) provided
within the program Projects of Large Research,
Development and Innovations Infrastructures.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Steve</given-names>
            <surname>Lawrence</surname>
          </string-name>
          , Clyde Lee Giles, and Ah Chung Tsoi.
          <article-title>What size neural network gives optimal generalization? convergence properties of backpropagation</article-title>
          .
          <source>Technical report</source>
          , Institute for Advanced Computer Studies University of Maryla,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Hastie</surname>
          </string-name>
          , Robert Tibshirani, and
          <string-name>
            <given-names>Jerome</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>The elements of statistical learning: data mining, inference and prediction</article-title>
          . Springer,
          <volume>2</volume>
          <fpage>edition</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ian</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          , Yoshua Bengio, and
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Courville</surname>
          </string-name>
          .
          <source>Deep Learning</source>
          . MIT Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Radford</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Neal</surname>
          </string-name>
          .
          <article-title>Priors for infinite networks</article-title>
          .
          <source>In Bayesian Learning for Neural Networks</source>
          , pages
          <fpage>29</fpage>
          -
          <lpage>53</lpage>
          . Springer,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Christopher</surname>
            <given-names>K. I.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Computing with infinite networks</article-title>
          .
          <source>Advances in neural information processing systems</source>
          , pages
          <fpage>295</fpage>
          -
          <lpage>301</lpage>
          ,
          <year>1997</year>
          . morgan kaufmann publishers.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tamir</given-names>
            <surname>Hazan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tommi</given-names>
            <surname>Jaakkola</surname>
          </string-name>
          .
          <article-title>Steps toward deep kernel methods from infinite neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1508.05133</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jaehoon</given-names>
            <surname>Lee</surname>
          </string-name>
          , Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha SohlDickstein.
          <source>Deep Neural Networks as Gaussian Processes</source>
          ,
          <year>2017</year>
          . _eprint:
          <volume>1711</volume>
          .
          <fpage>00165</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Anselm</given-names>
            <surname>Blumer</surname>
          </string-name>
          , Andrzej Ehrenfeucht, David Haussler, and
          <article-title>Manfred K Warmuth</article-title>
          .
          <article-title>Occam's razor</article-title>
          .
          <source>Information processing letters</source>
          ,
          <volume>24</volume>
          (
          <issue>6</issue>
          ):
          <fpage>377</fpage>
          -
          <lpage>380</lpage>
          ,
          <year>1987</year>
          . Publisher: Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Carl</given-names>
            <surname>Edward</surname>
          </string-name>
          Rasmussen and
          <string-name>
            <given-names>Zoubin</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          .
          <article-title>Occam's razor</article-title>
          .
          <source>Advances in neural information processing systems</source>
          , pages
          <fpage>294</fpage>
          -
          <lpage>300</lpage>
          ,
          <year>2001</year>
          . Publisher: MIT;
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Brownlee</surname>
          </string-name>
          .
          <source>Better Deep Learning: Train Faster, Reduce Overfitting, and Make Better Predictions. Machine Learning Mastery</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Günter</surname>
            <given-names>Klambauer</given-names>
          </string-name>
          , Thomas Unterthiner, Andreas Mayr, and
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          .
          <article-title>Self-Normalizing Neural Networks</article-title>
          .
          <source>arXiv:1706</source>
          .02515 [cs, stat],
          <year>September 2017</year>
          . arXiv:
          <volume>1706</volume>
          .
          <fpage>02515</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Nitish</surname>
            <given-names>Srivastava</given-names>
          </string-name>
          , Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <article-title>Dropout: A Simple Way to Prevent Neural Networks from Overfitting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>15</volume>
          :
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          ,
          <year>June 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Scholkopf</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexander J.</given-names>
            <surname>Smola</surname>
          </string-name>
          .
          <article-title>Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond</article-title>
          . MIT Press, Cambridge, MA, USA,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Huber</surname>
          </string-name>
          . Robust statistics. Wiley, New York, 2nd edition,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Frank</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hampel</surname>
          </string-name>
          .
          <article-title>The influence curve and its role in robust estimation</article-title>
          .
          <source>Journal of the american statistical association</source>
          ,
          <volume>69</volume>
          (
          <issue>346</issue>
          ):
          <fpage>383</fpage>
          -
          <lpage>393</lpage>
          ,
          <year>1974</year>
          . Publisher: Taylor &amp; Francis.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Irad</given-names>
            <surname>Ben-Gal</surname>
          </string-name>
          .
          <article-title>Outlier detection</article-title>
          .
          <source>In Data mining and knowledge discovery handbook</source>
          , pages
          <fpage>131</fpage>
          -
          <lpage>146</lpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Hancong</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Sirish Shah, and
          <string-name>
            <given-names>Wei</given-names>
            <surname>Jiang</surname>
          </string-name>
          .
          <article-title>On-line outlier detection and data cleaning</article-title>
          .
          <source>Computers &amp; Chemical Engineering</source>
          ,
          <volume>28</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1635</fpage>
          -
          <lpage>1647</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Sridhar</surname>
            <given-names>Ramaswamy</given-names>
          </string-name>
          , Rajeev Rastogi, and
          <string-name>
            <given-names>Kyuseok</given-names>
            <surname>Shim</surname>
          </string-name>
          .
          <article-title>Efficient Algorithms for Mining Outliers from Large Data Sets</article-title>
          .
          <source>In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD '00</source>
          , pages
          <fpage>427</fpage>
          -
          <lpage>438</lpage>
          , New York, NY, USA,
          <year>2000</year>
          .
          <article-title>Association for Computing Machinery</article-title>
          . event-place: Dallas, Texas, USA.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Peter J Rousseeuw</surname>
          </string-name>
          .
          <article-title>Least median of squares regression</article-title>
          .
          <source>Journal of the American statistical association</source>
          ,
          <volume>79</volume>
          (
          <issue>388</issue>
          ):
          <fpage>871</fpage>
          -
          <lpage>880</lpage>
          ,
          <year>1984</year>
          . Publisher: Taylor &amp; Francis.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Rousseeuw and Katrien Van Driessen</surname>
          </string-name>
          .
          <article-title>Computing LTS regression for large data sets</article-title>
          .
          <source>Data mining and knowledge discovery</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ):
          <fpage>29</fpage>
          -
          <lpage>45</lpage>
          ,
          <year>2006</year>
          . Publisher: Springer.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kalina</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vidnerová. Robust Multilayer</surname>
          </string-name>
          <article-title>Perceptrons: Robust Loss Functions and Their Derivatives</article-title>
          . In L. Iliadis,
          <string-name>
            <given-names>P.</given-names>
            <surname>Angelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jayne</surname>
          </string-name>
          , and E. Pimenidis, editors,
          <source>Proceedings of the 21st EANN (Engineering Applications of Neural Networks) 2020 Conference</source>
          , pages
          <fpage>546</fpage>
          -
          <lpage>557</lpage>
          , Cham,
          <year>2020</year>
          . Springer. event-place: Cham.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neoral</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vidnerová</surname>
          </string-name>
          .
          <article-title>Effective automatic method selection for nonlinear regression modelling</article-title>
          .
          <source>International Journal of Neural Systems</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Frank</given-names>
            <surname>Wilcoxon</surname>
          </string-name>
          .
          <article-title>Individual Comparisons by Ranking Methods</article-title>
          .
          <source>Biometrics Bulletin</source>
          ,
          <volume>1</volume>
          (
          <issue>6</issue>
          ):
          <fpage>80</fpage>
          -
          <lpage>83</lpage>
          ,
          <year>1945</year>
          . Publisher: JSTOR.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Sture</given-names>
            <surname>Holm</surname>
          </string-name>
          .
          <article-title>A simple sequentially rejective multiple test procedure</article-title>
          .
          <source>Scandinavian journal of statistics</source>
          , pages
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
          ,
          <year>1979</year>
          . Publisher: JSTOR.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Alessio</surname>
            <given-names>Benavoli</given-names>
          </string-name>
          , Giorgio Corani, and
          <string-name>
            <given-names>Francesca</given-names>
            <surname>Mangili</surname>
          </string-name>
          .
          <source>Should We Really Use Post-Hoc Tests Based on MeanRanks? Journal of Machine Learning Research</source>
          ,
          <volume>17</volume>
          (
          <issue>5</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>