A Comparison of Regularization Techniques for Shallow Neural Networks Trained on Small Datasets Jiří Tumpach1,2 , Jan Kalina2 , and Martin Holeňa2 1 Charles University, Faculty of Mathematics and Physics, Prague, 2 The Czech Academy of Sciences, Institute of Computer Science, Prague, {tumpach,kalina,martin}@cs.cas.cz Abstract: Neural networks are frequently used as regres- In this paper we are interested in these areas where the sion models. Their training is usually difficult when the network should struggle because we intend to use neu- model is subject to a small training dataset with numerous ral networks as approximation for surrogate modeling in outliers. black-box optimisation. Surrogate models are local mod- This paper investigates the effects of various regularisa- els that estimate an unknown function in order to select tion techniques that can help with this kind of problem. We better candidates for evaluation and in this way reduce the analysed the effects of the model size, loss selection, L2 cost of optimisation. weight regularisation, L2 activity regularisation, Dropout, That motivated the research reported in this paper, in and Alpha Dropout. which we have investigated 5 different regularisation tech- We collected 30 different datasets, each of which has niques in different configurations on 30 different datasets. been split by ten-fold cross-validation. As an evaluation metric, we used cumulative distribution functions (CDFs) of L1 and L2 losses to aggregate results from different 2 Methods datasets without a considerable amount of distortion. Dis- 2.1 Regularisation tributions of the metrics are shown, and thorough statisti- cal tests were conducted. Regularisation is a broad term used for methods that add Surprisingly, the results show that Dropout models are some new prior belief [2, 3] to a specific machine learning not suited for our objective. The most effective approach is method. The belief should redefine the problem to achieve the choice of model size and L2 types of regularisations. a solution modified in the sense described by Occam’s ra- zor principle [8, 9] – more complex hypotheses are less likely than simple hypothesis. For example, the lasso/ridge 1 Introduction shrinkage methods in linear regression add a new term that makes small values more suitable as the solution to a prob- Neural networks are nature-inspired regression models in- lem [2]. creasingly important in machine learning. This type of model excels in predictive power but it has a poor robust- ness to outliers, if the training dataset has a small number The networks size regularisation As with many other re- of samples, the target function is complicated, or the net- gression models, the number of free parameters has critical work is over-parametrized for the problem [1, 2, 3]. consequence [3]. Models with a small number of free pa- On the other hand, novel theoretical analyses show dif- rameters can handle only simple relationships, while large ferent perspective on neural networks. In [4, 5] the authors models can be more flexible. On the other hand, a large investigate the effect of priors over weights for infinitely model needs more samples in order to achieve more re- wide single layer neural network and show that a Gaus- liable predictions for all parameters. If that requirement sian prior results in a Gaussian process prior over its func- is not met, the regression can over-fit – the model finds tions. The Gaussian process is a smooth non-parametric some non-sensible but possible relationships in the train- model well known for its generalisation properties, so it ing dataset, which not valid in general. leads to the conjecture that there is no need to avoid over- fitting of such a network. That idea was further generalised Weight regularisation One possible solution to the free to two-layer neural networks in [6] and general deep neu- parameters problem is a restriction of parameter domains. ral networks in [7]. Experiments in [7] show that finite- In neural networks, it is done using weight regularisation. width neural networks approach the infinite counterparts In fact, the domains of the parameters are unchanged, but through increasing their width. The authors of [7] further the probability of larger values is strongly reduced because pointed out that Dropout could be an interesting potential of an alternation of an optimisation objective. For exam- improvement. ple, the L2 type of the weight regularisation adds new term Lw2 to the loss of particular network. It is defined as Copyright ©2021 for this paper by its authors. Use permitted under Lw2 = ∑ w2i (1) Creative Commons License Attribution 4.0 International (CC BY 4.0). i Where wi stands for the value of the i-th parameter. It is Input Hidden Hidden usually applied only to weights, not to biases. layer layer layer 2 I1 Activity regularisation The third type of regularisation H1 is the activity regularisation [10]. In short, it penalises big values coming from neurons. The effect may seem similar I2 to weight regularisation, but it may have more potential in cases where the size of a layer is large enough. The rea- I3 .. .. son is that the weighted activities could count up to large numbers in spite of small values of the weights. Activ- ity regularisation is a way of making the input information .. . Hn . denser which is a nice property that is commonly utilised in autoencoder-type neural networks [3, 10]. In . Dropout Dropout technique essentially mimics the bag- ging technique, which is regularly used for improving gen- eralisations of multiple models by aggregating the results [11, 12, 3]. When Dropout is applied to a specific layer, the training Figure 1: Dropout regularisation. The red circle depicts and testing phases differ. When the model is in the training the neuron where Dropout regularisation causes the out- stage, the results are randomly dropped – replaced with put to be masked – in the current iteration, the output is zeros. Therefore the next layer is forced to adapt to this set to zero, and all following layers are computed as nor- incomplete information. In the testing phase, the random mal. This effect causes the updates of immediate incom- sampling is replaced by a multiplicative constant in order ing connections of that neuron to be zero, but other up- to maintain mean values of activation for the next layer1 . dates can still modify all other previous weights through Consequently, it increases the robustness of the model non-dropped neurons. In this case, all red edges represent and does not require any other model to train. The main changes brought by gradients from other neurons that may difference compared with bagging is that the models in influence the dropped neuron in the following iterations. Dropout are dependent – they share weights. Such a shar- ing is illustrated in Figure 1. as: 1 |D| Alpha Dropout Standard Dropout is suited for rectified MSE(D) = ∑ i=1 (yi − ŷi )2 (2) 2|D| linear units because zero is the default value of this ac- 1 |D| tivation [11]. Alpha Dropout is a slight modification for MAE(D) = |yi − ŷi | (3) smoother activation functions. It deals not only with the |D| ∑i=1 mean, but also with the variance. It is based on main- 1 |D| min (yi − ŷi )2 , 2|yi − ŷi | − 1 ,  Huber(D) = ∑ i=1 taining a walking average of neurons’ outputs and scaling 2|D| them accordingly. (4) where D is the dataset on which the loss is calculated, |D| 2.2 Loss functions is its size, yi is the target value of the i-th sample, and ŷi is its prediction. A crucial part of any machine learning model selection is It is common to assume that the dataset is outlier-free the definition of a loss function (prediction error measure, and normally distributed; therefore the MSE is the first performance measure) [13, 2]. The loss function should choice regarding the selection of the loss function. be fast, convex and should match the random noise that MAE/Huber losses are good replacement whenever the can be found in the data. Frequently, the Mean Absolute data are known to have outliers, or the MSE has not per- Error (MAE) and Mean Square Error (MSE) functions are formed well for an unknown reason. selected because the corresponding noise is additive and generated by Laplacian and Gaussian distribution, respec- tively. In addition, also the Huber loss function (cf. [14]) Robust loss functions A common way of dealing with is commonly used. These three loss functions are defined outliers is to remove them from training data or choose an entirely different model2 [2]. Even though the outlier removal has been thoroughly studied, the exact definition of an outlier highly depends 1 An alternative is to use the constant in the training phase. 2 Like k-NN or regression tree. on the problem we want to solve. There exists definitions Table 1: All datasets considered in the analysis. of an outlier relying on median absolute deviation [15], Name no. features no. samples quantile and medoid [16], online Kalman filter [17] or Concrete Compressive Strength 9 1030 nearest neighbour based filtering [18]. The Boston Housing 14 506 A different approach is proposed in [19] and improved Auto MPG 8 398 Proben1 (3d reg.) 6 4208 in [20] where authors deal with robust linear regression by Misra1a 2 14 removing the most prominent residuals in the loss func- Chwirut2 3 54 Chwirut1 3 214 tion. That idea was further adapted for neural networks Lanczos3 6 24 in [21] or nonlinear regression with a known regression Gauss1 8 250 Gauss2 8 250 function in [22]. Essentially, these methods exploit the DanWood 2 6 idea that neural networks can learn algorithms (hypothe- Kirby2 5 151 sis). With an assumption that more complex algorithms Hahn1 7 236 Nelson 3 128 are harder to learn, the prior belief that reduces the proba- MGH17 5 33 bility of more complex hypotheses also serves as an outlier Lanczos1 6 24 Lanczos2 6 24 removal tool. Gauss3 8 250 Extensions Least Trimmed Squares (LTS) and Least Roszman1 4 25 ENSO 9 168 Trimmed Absolute Deviations (LTA) of MSE (2) and MGH09 4 11 MAE (3) that follow the approach recalled in the previous Thurber 7 37 paragraph and we used them in out analysis are defined in BoxBOD 2 6 Rat42 3 9 the following way MGH10 3 16 Eckerle4 3 35 1 |D| Rat43 4 15 ρ (yi − ŷi )2  LTS(D) = ∑ i=1 (5) Bennett5 3 154 0.9|D| 1 |D| LTA(D) = ρ (|yi − ŷi |) , (6) 0.9|D| ∑i=1 Hyperparameter Considered values  Loss MSE, MAE, Huber, LTS, LTA xi if less than 90 % of residuals Size of the first layer 4, 8, 16, 32, 64, 128, 256, 512, where ρ(xi ) = 0 otherwise 1024, 2048, 4096 Model regularisation No regularisation, 3 Methodology 50% Dropout, 50% Alpha Dropout 3.1 Datasets and their preparation Table 2: Options for the first set of experiments. We selected 30 datasets containing a relatively small num- ber of samples. These are real-world as well as artificially generated publicly available datasets, for which a nonlin- was mapped by the corresponding ECDF, creating normal- ear regression model (i.e. explaining a given variable as a ized order of results in a particular bin. Finally, all results response against predictors under uncertainty) is a mean- are combined back together. ingful task. The list of the 30 datasets is presented in Ta- To compare normalized results for a specific hyperpa- ble 1. Only datasets without missing values were selected. rameter, we split combined results by the value. These A ten-fold validation has been employed in order to ob- splits create empirical distributions of normalized results, tain more reliable results. If the dataset had less than ten which a violin plot can reasonably visualize. samples, we used leave-one-out cross-validation instead. Each feature was standardised according to training data in a specific fold. Hyperparameter Considered values Loss MSE, MAE, Huber 3.2 Aggregation of results Size of the first layer 4, 8, 16, 32, 64, 128, 256, 512, It is not possible to visualize the results of regression meth- 1024, 2048, 4096, 8192 ods across multiple datasets and loss functions. For ex- L2-weight – 0.001, 0.01, 0.1, ample, some datasets are easier than others, and one loss Model regularisation 1, 10 function highlights outliers more, so most of the loss is L2-activity – 0.001, 0.01, 0.1, made of one sample. 1, 10 We tackle this problem by separating the specific Alpha dropout – 0.1, 0.2, 0.4, dataset, fold, and function in a separate bin. In this bin, 0.6, 0.8 we learn the order of results creating empirical cumulative distribution function (ECDF). Every result in a specific bin Table 3: Options for the second set of experiments. 3.3 Statistical tests causes the output to be stochastic, so the error is stochas- tic too. The stochastic error can cause accidental results, We have used only non-parametric blocking statistical which can stop the training prematurely. Because we have tests because the results have limited values, and we one mini-batch, the variance is too significant not to be wanted to utilise as much information as possible. The perceptible. Friedman test was used to decide whether a particular view Though not tried in our experiments, a possible remedy on some hyperparameter includes is drawn from the same could be early stopping variation where the error is expo- distribution or not. At this point, the ECDF mapping is nentially smoothed. not needed because the test is non-parametric. All statisti- cal tests use the usual 5% significance level. Multiple comparison tests were done using Wilcoxon 4.2 Size of models signed-rank test [23] with Holm correction [24] instead of In the second and third experiments, we were interested mean-ranks post-hoc tests which can create inconsisten- in network size and its effect on performance. Figure 3a cies and paradoxical situations in machine learning sce- and Table 4 show non-regularised models and Figure 3b narios [25]. and Table 5 show the Dropout variants combined together. Non-regularized results are better than the Dropout vari- 3.4 Architecture ants, which are less stable and have delayed response on the increase of network size. We selected three-layered architecture. The first layer has The stability may come from the same source as the pre- T neurons, the second layer has always T/2 and the third vious problem - the early stopping could make the model layer always has one neuron. The first two layers have undertrained. The delay may be the result of the selected Scaled Exponential Linear Units (SELU) as an activation dropout rate. Because we used a dropout rate of 50%, function, and the third layer has a linear function. the real amount of usable information can be effectively halved in each hidden layer (given that there is no space 3.5 Training or resources to make the information denser). Together it is a 4x delay which is not enough to explain the findings We trained our models with a NAdam optimiser with a (the optimum size of the model is 24 vs 27 ). Possible other 0.001 learning rate. We use early stopping with patience = reasons could be 10 and delta = 1e−10 to speed up the training. Even though this is another type of regularisation, we use it in such a • the difficulty of encoding uncertain patterns manner that its effect is minuscule. The maximal number • undertraining, due to early stopping of epochs is set to 10000, and batch size is equivalent to the size of the largest dataset. In the first set of experiments, 4.3 Loss function we produced 48 014 neural networks and their results. In the second set, we managed to prepare 178 092 models. In the fourth experiment, we analyzed the effect of a loss function for models without regularisation. Trimmed variants performed poorly probably because they remove 4 Results some residuals (10%) and, therefore, reduce dataset size 4.1 Dropout regularisation even more. In our case, Mean Squared Error (MSE) is better fitted than Mean Average Error (MAE). From the In the first experiment, we compared between 3 set- distribution in Figure 4 it seems that MSE has much worse tings – no regularisation, dropout regularisation and, alpha results, but the median value (shown as the white point in dropout regularisation. Both dropout techniques are set the central part of the graph) of MSE is better than that of to 50% probability. The results are in Figure 2, number of MAE. The best loss function is the Huber loss. All results models that are better than the same hyperparameter coun- can be seen in Table 6. terpart can be seen in Table 2b for L1 loss and Table 2c The Huber loss combines benefits of both worlds be- for L2 loss. We highlighted in bold values that Wilcoxon cause its derivatives are dependent on the size of error signed-rank test with Holm correction found significantly (from MSE) while limiting the maximum value (from better than the column value. MAE). This effect may be responsible for the best result It seems that the regularisation does not help. It may among the considered loss functions. be caused by the exaggerated value of the Dropout rate or a need for such models to have wider layers. We do not 4.4 Weight regularisation know the reason why Alpha Dropout performed so badly – it should be better because we used SELU as an activation The weight regularisation has a prominent effect on the re- function. sults, as revealed in Figure 5. Too much is certainly worse One possible explanation for this poor performance is than no weight normalisation, but suitable values signifi- the use of early stopping. In the training phase, the dropout cantly reduce bad results. (a) Distributions of scaled results by empirical cumulative distribution functions for each dataset and loss separately. No reg. Dropout Alpha Dropout No reg. Dropout Alpha Dropout No reg. 10898 14070 No reg. 10904 14020 Dropout 4557 12360 Dropout 4551 12352 Alpha Dropout 1385 3095 Alpha Dropout 1435 3103 (b) Statistical tests for L1 loss function (c) Statistical tests for L2 loss function Figure 2: The first experiment compared different kinds of dropout regularizers across all different combinations of datasets and hyperparameters. The tables display the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold. The statistical tests clearly prefer models without dropout. 4 8 16 32 64 128 256 512 1024 2048 4096 4 580 518 604 683 767 863 975 1047 1120 1177 8 825 636 707 777 827 921 1045 1129 1213 1238 16 887 769 797 859 871 978 1096 1184 1253 1275 32 801 698 608 818 878 999 1122 1197 1255 1270 64 722 628 546 587 885 1046 1148 1222 1268 1278 128 638 578 534 527 520 1031 1164 1229 1263 1282 256 542 484 427 406 359 374 999 1151 1189 1232 512 430 360 309 283 257 241 406 1042 1075 1187 1024 358 276 221 208 183 176 254 363 961 1072 2048 285 192 152 150 137 142 216 330 444 973 4096 228 167 130 135 127 123 173 218 333 432 Table 4: The second experiment compares L1 loss for models of different sizes without regularization. The table displays the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold. 4 8 16 32 64 128 256 512 1024 2048 4096 4 1170 948 737 717 795 870 1030 1285 1682 2087 8 1640 1089 802 759 849 926 1088 1412 1855 2197 16 1862 1721 982 892 932 1024 1210 1654 2065 2373 32 2073 2008 1828 1268 1255 1247 1490 1967 2335 2542 64 2093 2051 1918 1542 1347 1355 1620 2111 2438 2588 128 2015 1961 1878 1555 1463 1406 1740 2193 2453 2614 256 1940 1884 1786 1563 1455 1404 1792 2280 2514 2623 512 1780 1722 1600 1320 1190 1070 1018 2127 2417 2585 1024 1525 1398 1156 843 699 617 530 683 2083 2411 2048 1128 955 745 475 372 357 296 393 727 2050 4096 723 613 437 268 222 196 187 225 399 760 Table 5: The third experiment compares L1 loss for models of different sizes with Dropout or Alpha Dropout regular- ization. The table displays the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold. (a) Distributions of test losses for non-regularized models. (b) Distributions of test losses for Dropout and Alpha Dropout models combined together. Figure 3: The second and third experiment analyses the effect of model size on regularized and non-regularized models. The regularisation delays over-training but does not improve the results. Figure 4: The results of the fourth experiment – comparison between loss functions for non-regularized models. MSE has a lot of good and bad results; Huber seems to be the best; trimmed losses are the worst. Figure 5: The fifth experiment exposes the effect of L2 normalisation weight. Figure 6: The sixth experiment exposes the effect of L2 activity normalisation weight. Figure 7: The seventh experiment exposes the effect of Alpha Dropout rate (probability). Huber MAE MSE LTA LTS Huber 1854 1614 2111 2184 MAE 1237 1341 2125 2171 MSE 1477 1750 2006 2085 LTA 980 966 1085 1750 LTS 907 920 1006 1341 Table 6: The fourth experiment compares L1 losses for models trained by optimization of different loss functions without regularization. The table displays the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold. 0.0 0.001 0.01 0.1 1.0 10.0 0.0 4430 3705 5053 6260 8348 0.001 5686 4364 5735 6987 8967 0.01 6411 5752 7086 8278 9601 0.1 5063 4381 3030 8325 9686 1.0 3856 3129 1838 1790 9449 10.0 1768 1149 515 430 647 Table 7: The fifth experiment exposes the effect of L2 normalisation weight. The table displays the number of experiments where one weight of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold. 0.0 0.001 0.01 0.1 1.0 10.0 0.0 5197 4895 4813 5787 7044 0.001 4919 4921 4901 5833 7115 0.01 5221 5195 5237 6352 7654 0.1 5303 5215 4879 7313 8509 1.0 4329 4283 3764 2803 8631 10.0 3072 3001 2462 1607 1485 Table 8: The sixth experiment exposes the effect of L2 activity normalisation weight. The table displays the number of experiments where one weight of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold. 0.1 0.2 0.4 0.6 0.8 0.1 7226 7654 7614 7275 0.2 2890 7075 6813 6335 0.4 2462 3041 6137 5684 0.6 2502 3303 3979 5449 0.8 2841 3781 4432 4667 Table 9: The seventh experiment exposes the effect of Alpha Dropout rate (probability). The table displays the number of experiments where one rate of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold. If the regularisation is too high, the loss is effectively re- References placed only with the term that reduces weights on the net- work’s connections. If it is too low, the network can lack [1] Steve Lawrence, Clyde Lee Giles, and Ah Chung Tsoi. regularisation – creating potentially volatile responses. What size neural network gives optimal generalization? convergence properties of backpropagation. Technical re- port, Institute for Advanced Computer Studies University 4.5 Activity regularisation of Maryla, 1996. [2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference In our case, the effect of activity regularisation is similar and prediction. Springer, 2 edition, 2009. but smaller than the weight penalty. The difference in [3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. weight and activity regularisation effectiveness can be ex- Deep Learning. MIT Press, 2016. plained by the specific activation used in training. The [4] Radford M. Neal. Priors for infinite networks. In Bayesian results are in Figure 6 and in Table 8. Learning for Neural Networks, pages 29–53. Springer, 1996. [5] Christopher K. I. Williams. Computing with infinite net- 4.6 Alpha Dropout rate works. Advances in neural information processing systems, pages 295–301, 1997. morgan kaufmann publishers. In Figure 7 and Table 9 the effects of Alpha Dropout rate [6] Tamir Hazan and Tommi Jaakkola. Steps toward deep ker- can be seen. It may be good to investigate smaller values nel methods from infinite neural networks. arXiv preprint more because the 0.1 rate is the best. The preference for arXiv:1508.05133, 2015. not having this regularisation can be explained equally as [7] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. in the subsection 4.1. Schoenholz, Jeffrey Pennington, and Jascha Sohl- Dickstein. Deep Neural Networks as Gaussian Processes, 2017. _eprint: 1711.00165. 5 Conclusion [8] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam’s razor. Information process- ing letters, 24(6):377–380, 1987. Publisher: Elsevier. In this paper, we analyzed several types of regularisation [9] Carl Edward Rasmussen and Zoubin Ghahramani. Occam’s techniques on databases where effective hyperparameter razor. Advances in neural information processing systems, optimization is not possible due to the lack of samples pages 294–300, 2001. Publisher: MIT; 1998. or the existence of outliers in the database. We showed [10] Jason Brownlee. Better Deep Learning: Train Faster, Re- that Dropout techniques in these scenarios are not a good duce Overfitting, and Make Better Predictions. Machine choice because their results are not stable enough to com- Learning Mastery, 2018. pete with models without regularisation. The model’s size [11] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, is an essential aspect, and it seems that the optimum has a and Sepp Hochreiter. Self-Normalizing Neural Networks. far bigger number of free parameters than the theoretical arXiv:1706.02515 [cs, stat], September 2017. arXiv: number computed using the average across our training 1706.02515. databases. Huber loss function is the best because it does [12] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya not suffer from inconsistencies of MAE or MSE losses. Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Trimmed variants of loss functions [21] performed poorly Way to Prevent Neural Networks from Overfitting. Journal here, but they may be better if a particular dataset has more of Machine Learning Research, 15:1929–1958, June 2014. samples than we had. The third best hyperparameter to [13] Bernhard Scholkopf and Alexander J. Smola. Learning look for is the weight normalization – small weight dra- with Kernels: Support Vector Machines, Regularization, matically reduces the frequency of bad results while keep- Optimization, and Beyond. MIT Press, Cambridge, MA, ing the median of results low. USA, 2001. [14] P. J. Huber. Robust statistics. Wiley, New York, 2nd edition, 2009. 6 Acknowledgement [15] Frank R. Hampel. The influence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383–393, 1974. Publisher: Taylor & Francis. The research reported in this paper has been supported [16] Irad Ben-Gal. Outlier detection. In Data mining and knowl- by SVV project number 260 575 and partially supported edge discovery handbook, pages 131–146. Springer, 2005. by the Czech Science Foundation (GA ČR) projects 18- [17] Hancong Liu, Sirish Shah, and Wei Jiang. On-line outlier 18080S and 19-05704S. detection and data cleaning. Computers & Chemical Engi- Computational resources were supplied by the project neering, 28(9):1635–1647, 2004. "e-Infrastruktura CZ" (e-INFRA LM2018140) provided [18] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. within the program Projects of Large Research, Develop- Efficient Algorithms for Mining Outliers from Large Data ment and Innovations Infrastructures. Sets. In Proceedings of the 2000 ACM SIGMOD Inter- national Conference on Management of Data, SIGMOD ’00, pages 427–438, New York, NY, USA, 2000. Associa- tion for Computing Machinery. event-place: Dallas, Texas, USA. [19] Peter J Rousseeuw. Least median of squares regres- sion. Journal of the American statistical association, 79(388):871–880, 1984. Publisher: Taylor & Francis. [20] Peter Rousseeuw and Katrien Van Driessen. Computing LTS regression for large data sets. Data mining and knowl- edge discovery, 12(1):29–45, 2006. Publisher: Springer. [21] J. Kalina and P. Vidnerová. Robust Multilayer Perceptrons: Robust Loss Functions and Their Derivatives. In L. Iliadis, P. Angelov, C. Jayne, and E. Pimenidis, editors, Proceed- ings of the 21st EANN (Engineering Applications of Neural Networks) 2020 Conference, pages 546–557, Cham, 2020. Springer. event-place: Cham. [22] J. Kalina, A. Neoral, and P. Vidnerová. Effective automatic method selection for nonlinear regression modelling. Inter- national Journal of Neural Systems, 2021. [23] Frank Wilcoxon. Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6):80–83, 1945. Pub- lisher: JSTOR. [24] Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65– 70, 1979. Publisher: JSTOR. [25] Alessio Benavoli, Giorgio Corani, and Francesca Mangili. Should We Really Use Post-Hoc Tests Based on Mean- Ranks? Journal of Machine Learning Research, 17(5):1– 10, 2016.