A Comparison of Regularization Techniques for Shallow Neural Networks Trained on Small Datasets

A Comparison of Regularization Techniques for Shallow Neural Networks Trained on Small Datasets JiříTumpach tumpach@cs.cas.cz Faculty of Mathematics and Physics Charles University

Prague

The Czech Academy of Sciences Institute of Computer Science

Prague

JanKalina kalina@cs.cas.cz The Czech Academy of Sciences Institute of Computer Science

Prague

MartinHoleňa martin@cs.cas.cz The Czech Academy of Sciences Institute of Computer Science

Prague

A Comparison of Regularization Techniques for Shallow Neural Networks Trained on Small Datasets 8980AF111B18F67704FEF510351AF622 GROBID - A machine learning software for extracting information from scholarly documents

Neural networks are frequently used as regression models. Their training is usually difficult when the model is subject to a small training dataset with numerous outliers.

This paper investigates the effects of various regularisation techniques that can help with this kind of problem. We analysed the effects of the model size, loss selection, L2 weight regularisation, L2 activity regularisation, Dropout, and Alpha Dropout.

We collected 30 different datasets, each of which has been split by ten-fold cross-validation. As an evaluation metric, we used cumulative distribution functions (CDFs) of L1 and L2 losses to aggregate results from different datasets without a considerable amount of distortion. Distributions of the metrics are shown, and thorough statistical tests were conducted.

Surprisingly, the results show that Dropout models are not suited for our objective. The most effective approach is the choice of model size and L2 types of regularisations.

Introduction

Neural networks are nature-inspired regression models increasingly important in machine learning. This type of model excels in predictive power but it has a poor robustness to outliers, if the training dataset has a small number of samples, the target function is complicated, or the network is over-parametrized for the problem [1,2,3].

On the other hand, novel theoretical analyses show different perspective on neural networks. In [4,5] the authors investigate the effect of priors over weights for infinitely wide single layer neural network and show that a Gaussian prior results in a Gaussian process prior over its functions. The Gaussian process is a smooth non-parametric model well known for its generalisation properties, so it leads to the conjecture that there is no need to avoid overfitting of such a network. That idea was further generalised to two-layer neural networks in [6] and general deep neural networks in [7]. Experiments in [7] show that finitewidth neural networks approach the infinite counterparts through increasing their width. The authors of [7] further pointed out that Dropout could be an interesting potential improvement.

In this paper we are interested in these areas where the network should struggle because we intend to use neural networks as approximation for surrogate modeling in black-box optimisation. Surrogate models are local models that estimate an unknown function in order to select better candidates for evaluation and in this way reduce the cost of optimisation.

That motivated the research reported in this paper, in which we have investigated 5 different regularisation techniques in different configurations on 30 different datasets.

Methods

Regularisation

Regularisation is a broad term used for methods that add some new prior belief [2,3] to a specific machine learning method. The belief should redefine the problem to achieve a solution modified in the sense described by Occam's razor principle [8,9] -more complex hypotheses are less likely than simple hypothesis. For example, the lasso/ridge shrinkage methods in linear regression add a new term that makes small values more suitable as the solution to a problem [2].

The networks size regularisation As with many other regression models, the number of free parameters has critical consequence [3]. Models with a small number of free parameters can handle only simple relationships, while large models can be more flexible. On the other hand, a large model needs more samples in order to achieve more reliable predictions for all parameters. If that requirement is not met, the regression can over-fit -the model finds some non-sensible but possible relationships in the training dataset, which not valid in general.

Weight regularisation One possible solution to the free parameters problem is a restriction of parameter domains. In neural networks, it is done using weight regularisation. In fact, the domains of the parameters are unchanged, but the probability of larger values is strongly reduced because of an alternation of an optimisation objective. For example, the L2 type of the weight regularisation adds new term L w2 to the loss of particular network. It is defined as

L w2 = ∑ i w 2 i (1)

Where w i stands for the value of the i-th parameter. It is usually applied only to weights, not to biases.

Activity regularisation

The third type of regularisation is the activity regularisation [10]. In short, it penalises big values coming from neurons. The effect may seem similar to weight regularisation, but it may have more potential in cases where the size of a layer is large enough. The reason is that the weighted activities could count up to large numbers in spite of small values of the weights. Activity regularisation is a way of making the input information denser which is a nice property that is commonly utilised in autoencoder-type neural networks [3,10].

Dropout Dropout technique essentially mimics the bagging technique, which is regularly used for improving generalisations of multiple models by aggregating the results [11,12,3].

When Dropout is applied to a specific layer, the training and testing phases differ. When the model is in the training stage, the results are randomly dropped -replaced with zeros. Therefore the next layer is forced to adapt to this incomplete information. In the testing phase, the random sampling is replaced by a multiplicative constant in order to maintain mean values of activation for the next layer 1 .

Consequently, it increases the robustness of the model and does not require any other model to train. The main difference compared with bagging is that the models in Dropout are dependent -they share weights. Such a sharing is illustrated in Figure 1.

Alpha Dropout Standard Dropout is suited for rectified linear units because zero is the default value of this activation [11]. Alpha Dropout is a slight modification for smoother activation functions. It deals not only with the mean, but also with the variance. It is based on maintaining a walking average of neurons' outputs and scaling them accordingly.

Loss functions

A crucial part of any machine learning model selection is the definition of a loss function (prediction error measure, performance measure) [13,2]. The loss function should be fast, convex and should match the random noise that can be found in the data. Frequently, the Mean Absolute Error (MAE) and Mean Square Error (MSE) functions are selected because the corresponding noise is additive and generated by Laplacian and Gaussian distribution, respectively. In addition, also the Huber loss function (cf. [14]) is commonly used. These three loss functions are defined 1 An alternative is to use the constant in the training phase. . . . . . . . . .

I 1 I 2 I 3 I n H 1 H n

Input layer

Hidden layer Hidden layer 2

Figure 1: Dropout regularisation. The red circle depicts the neuron where Dropout regularisation causes the output to be masked -in the current iteration, the output is set to zero, and all following layers are computed as normal. This effect causes the updates of immediate incoming connections of that neuron to be zero, but other updates can still modify all other previous weights through non-dropped neurons. In this case, all red edges represent changes brought by gradients from other neurons that may influence the dropped neuron in the following iterations.

as:

MSE(D) = 1 2|D| ∑ |D| i=1 (y i − ŷi ) 2 (2) MAE(D) = 1 |D| ∑ |D| i=1 |y i − ŷi |(3)Huber(D) = 1 2|D| ∑ |D| i=1 min (y i − ŷi ) 2 , 2|y i − ŷi | − 1 ,(4)

where D is the dataset on which the loss is calculated, |D| is its size, y i is the target value of the i-th sample, and ŷi is its prediction.

It is common to assume that the dataset is outlier-free and normally distributed; therefore the MSE is the first choice regarding the selection of the loss function.

MAE/Huber losses are good replacement whenever the data are known to have outliers, or the MSE has not performed well for an unknown reason.

Robust loss functions A common way of dealing with outliers is to remove them from training data or choose an entirely different model2 [2].

Even though the outlier removal has been thoroughly studied, the exact definition of an outlier highly depends on the problem we want to solve. There exists definitions of an outlier relying on median absolute deviation [15], quantile and medoid [16], online Kalman filter [17] or nearest neighbour based filtering [18].

A different approach is proposed in [19] and improved in [20] where authors deal with robust linear regression by removing the most prominent residuals in the loss function. That idea was further adapted for neural networks in [21] or nonlinear regression with a known regression function in [22]. Essentially, these methods exploit the idea that neural networks can learn algorithms (hypothesis). With an assumption that more complex algorithms are harder to learn, the prior belief that reduces the probability of more complex hypotheses also serves as an outlier removal tool.

Extensions Least Trimmed Squares (LTS) and Least Trimmed Absolute Deviations (LTA) of MSE (2) and MAE (3) that follow the approach recalled in the previous paragraph and we used them in out analysis are defined in the following way

LTS(D) = 1 0.9|D| ∑ |D| i=1 ρ (y i − ŷi ) 2 (5) LTA(D) = 1 0.9|D| ∑ |D| i=1 ρ (|y i − ŷi |) ,(6)

where ρ(x i ) = x i if less than 90 % of residuals 0 otherwise 3 Methodology

Datasets and their preparation

We selected 30 datasets containing a relatively small number of samples. These are real-world as well as artificially generated publicly available datasets, for which a nonlinear regression model (i.e. explaining a given variable as a response against predictors under uncertainty) is a meaningful task. The list of the 30 datasets is presented in Table 1. Only datasets without missing values were selected.

A ten-fold validation has been employed in order to obtain more reliable results. If the dataset had less than ten samples, we used leave-one-out cross-validation instead.

Each feature was standardised according to training data in a specific fold.

Aggregation of results

It is not possible to visualize the results of regression methods across multiple datasets and loss functions. For example, some datasets are easier than others, and one loss function highlights outliers more, so most of the loss is made of one sample. We tackle this problem by separating the specific dataset, fold, and function in a separate bin. In this bin, we learn the order of results creating empirical cumulative distribution function (ECDF). Every result in a specific bin

Statistical tests

We have used only non-parametric blocking statistical tests because the results have limited values, and we wanted to utilise as much information as possible. The Friedman test was used to decide whether a particular view on some hyperparameter includes is drawn from the same distribution or not. At this point, the ECDF mapping is not needed because the test is non-parametric. All statistical tests use the usual 5% significance level. Multiple comparison tests were done using Wilcoxon signed-rank test [23] with Holm correction [24] instead of mean-ranks post-hoc tests which can create inconsistencies and paradoxical situations in machine learning scenarios [25].

Architecture

We selected three-layered architecture. The first layer has T neurons, the second layer has always T /2 and the third layer always has one neuron. The first two layers have Scaled Exponential Linear Units (SELU) as an activation function, and the third layer has a linear function.

Training

We trained our models with a NAdam optimiser with a 0.001 learning rate. We use early stopping with patience = 10 and delta = 1e −10 to speed up the training. Even though this is another type of regularisation, we use it in such a manner that its effect is minuscule. The maximal number of epochs is set to 10000, and batch size is equivalent to the size of the largest dataset. In the first set of experiments, we produced 48 014 neural networks and their results. In the second set, we managed to prepare 178 092 models.

Results

Dropout regularisation

In the first experiment, we compared between 3 settings -no regularisation, dropout regularisation and, alpha dropout regularisation. Both dropout techniques are set to 50% probability. The results are in Figure 2, number of models that are better than the same hyperparameter counterpart can be seen in Table 2b for L1 loss and Table 2c for L2 loss. We highlighted in bold values that Wilcoxon signed-rank test with Holm correction found significantly better than the column value.

It seems that the regularisation does not help. It may be caused by the exaggerated value of the Dropout rate or a need for such models to have wider layers. We do not know the reason why Alpha Dropout performed so badlyit should be better because we used SELU as an activation function.

One possible explanation for this poor performance is the use of early stopping. In the training phase, the dropout causes the output to be stochastic, so the error is stochastic too. The stochastic error can cause accidental results, which can stop the training prematurely. Because we have one mini-batch, the variance is too significant not to be perceptible.

Though not tried in our experiments, a possible remedy could be early stopping variation where the error is exponentially smoothed.

Size of models

In the second and third experiments, we were interested in network size and its effect on performance. Figure 3a and Table 4 show non-regularised models and Figure 3b and Table 5 show the Dropout variants combined together. Non-regularized results are better than the Dropout variants, which are less stable and have delayed response on the increase of network size.

The stability may come from the same source as the previous problem -the early stopping could make the model undertrained. The delay may be the result of the selected dropout rate. Because we used a dropout rate of 50%, the real amount of usable information can be effectively halved in each hidden layer (given that there is no space or resources to make the information denser). Together it is a 4x delay which is not enough to explain the findings (the optimum size of the model is 2 4 vs 2 7 ). Possible other reasons could be • the difficulty of encoding uncertain patterns • undertraining, due to early stopping

Loss function

In the fourth experiment, we analyzed the effect of a loss function for models without regularisation. Trimmed variants performed poorly probably because they remove some residuals (10%) and, therefore, reduce dataset size even more. In our case, Mean Squared Error (MSE) is better fitted than Mean Average Error (MAE). From the distribution in Figure 4 it seems that MSE has much worse results, but the median value (shown as the white point in the central part of the graph) of MSE is better than that of MAE. The best loss function is the Huber loss. All results can be seen in Table 6.

The Huber loss combines benefits of both worlds because its derivatives are dependent on the size of error (from MSE) while limiting the maximum value (from MAE). This effect may be responsible for the best result among the considered loss functions.

Weight regularisation

The weight regularisation has a prominent effect on the results, as revealed in Figure 5. Too much is certainly worse than no weight normalisation, but suitable values significantly reduce bad results. If the regularisation is too high, the loss is effectively replaced only with the term that reduces weights on the network's connections. If it is too low, the network can lack regularisation -creating potentially volatile responses.

Activity regularisation

In our case, the effect of activity regularisation is similar but smaller than the weight penalty. The difference in weight and activity regularisation effectiveness can be explained by the specific activation used in training. The results are in Figure 6 and in Table 8.

Alpha Dropout rate

In Figure 7 and Table 9 the effects of Alpha Dropout rate can be seen. It may be good to investigate smaller values more because the 0.1 rate is the best. The preference for not having this regularisation can be explained equally as in the subsection 4.1.

Conclusion

In this paper, we analyzed several types of regularisation techniques on databases where effective hyperparameter optimization is not possible due to the lack of samples or the existence of outliers in the database. We showed that Dropout techniques in these scenarios are not a good choice because their results are not stable enough to compete with models without regularisation. The model's size is an essential aspect, and it seems that the optimum has a far bigger number of free parameters than the theoretical number computed using the average across our training databases. Huber loss function is the best because it does not suffer from inconsistencies of MAE or MSE losses. Trimmed variants of loss functions [21] performed poorly here, but they may be better if a particular dataset has more samples than we had. The third best hyperparameter to look for is the weight normalization -small weight dramatically reduces the frequency of bad results while keeping the median of results low.

Acknowledgement

The research reported in this paper has been supported by SVV project number 260 575 and partially supported by the Czech Science Foundation (GA ČR) projects 18-18080S and 19-05704S.

Computational resources were supplied by the project "e-Infrastruktura CZ" (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

(a) Distributions of scaled results by empirical cumulative distribution functions for each dataset and loss separately.

Statistical tests for L2 loss function

Figure 2 :2Figure 2: The first experiment compared different kinds of dropout regularizers across all different combinations of datasets and hyperparameters. The tables display the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold. The statistical tests clearly prefer models without dropout.

(a) Distributions of test losses for non-regularized models. (b) Distributions of test losses for Dropout and Alpha Dropout models combined together.

Figure 3 :3Figure 3: The second and third experiment analyses the effect of model size on regularized and non-regularized models. The regularisation delays over-training but does not improve the results.

Figure 4 :4Figure 4: The results of the fourth experiment -comparison between loss functions for non-regularized models. MSE has a lot of good and bad results; Huber seems to be the best; trimmed losses are the worst.

Figure 5 :5Figure 5: The fifth experiment exposes the effect of L2 normalisation weight.

Figure 6 :6Figure 6: The sixth experiment exposes the effect of L2 activity normalisation weight.

Figure 7 :7Figure 7: The seventh experiment exposes the effect of Alpha Dropout rate (probability).

Table 1 :1All datasets considered in the analysis.Nameno. featuresno. samplesConcrete Compressive Strength1030The Boston Housing506Auto MPG398Proben1 (3d reg.)4208Misra1a14Chwirut254Chwirut1214Lanczos324Gauss1250Gauss2250DanWood6Kirby2151Hahn1236Nelson128MGH1733Lanczos124Lanczos224Gauss3250Roszman125ENSO168MGH0911Thurber37BoxBOD6Rat429MGH1016Eckerle435Rat4315Bennett5154HyperparameterConsidered valuesLossMSE, MAE, Huber, LTS, LTASize of the first layer 4, 8, 16, 32, 64, 128, 256, 512,1024, 2048, 4096Model regularisation No regularisation,50% Dropout,50% Alpha Dropout

Table 2 :2Options for the first set of experiments.was mapped by the corresponding ECDF, creating normalized order of results in a particular bin. Finally, all results are combined back together.To compare normalized results for a specific hyperparameter, we split combined results by the value. These splits create empirical distributions of normalized results, which a violin plot can reasonably visualize.HyperparameterConsidered valuesLossMSE, MAE, HuberSize of the first layer 4, 8, 16, 32, 64, 128, 256, 512,1024, 2048, 4096, 8192L2-weight -0.001, 0.01, 0.1,Model regularisation1, 10L2-activity -0.001, 0.01, 0.1,1, 10Alpha dropout -0.1, 0.2, 0.4,0.6, 0.8

Table 3 :3Options for the second set of experiments.

Table 5 :5The third experiment compares L1 loss for models of different sizes with Dropout or Alpha Dropout regularization. The table displays the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.

Table 6 :6The fourth experiment compares L1 losses for models trained by optimization of different loss functions without regularization. The table displays the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.

HuberMAEMSELTALTSHuber1854 1614 2111 2184MAE 12371341 2125 2171MSE 1477 17502006 2085LTA980966 10851750LTS907920 1006 13410.0 0.001 0.01 0.11.0 10.00.044303705 5053 6260 83480.001 56864364 5735 6987 89670.01641157527086 8278 96010.15063438130308325 96861.0385631291838 1790944910.017681149515430647

Table 7 :7The fifth experiment exposes the effect of L2 normalisation weight. The table displays the number of experiments where one weight of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.0.0 0.001 0.01 0.11.0 10.00.051974895 4813 5787 70440.001 49194921 4901 5833 71150.01522151955237 6352 76540.15303521548797313 85091.0432942833764 2803863110.0307230012462 1607 1485

Table 8 :8The sixth experiment exposes the effect of L2 activity normalisation weight. The table displays the number of experiments where one weight of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.0.10.20.40.60.80.17226 7654 7614 72750.2 28907075 6813 63350.4 2462 30416137 56840.6 2502 3303 397954490.8 2841 3781 4432 4667

Table 9 :9The seventh experiment exposes the effect of Alpha Dropout rate (probability). The table displays the number of experiments where one rate of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.Like k-NN or regression tree.

The second experiment compares L1 loss for models of different sizes without regularization. The table displays the number of experiments where one model size (row) 1030 1285 1682 2087 8 1640 1089 802 759 849 926 1088 1412 1855 2197 16 1862 1721 982 892 932 1024 1210 1654 2065 2373 32 2073 2008 1828 1268 1255 1247 1490 1967 2335 2542 64 2093 2051 1918 1542 1347 1355 1620 2111 2438 2588 128 2015 1961 1878 1555 1463 1406 1740 2193 2453 2614 256 1940 1884 1786 1563 1455 1404 1792 2280 2514 2623 512 1780 1722 1600 1320 1190 1070 1018 2127 2417 2585 1024 1525 1398 1156 843 699 617 530 683 2083 2411 2048 1128 32 ). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value What size neural network gives optimal generalization? convergence properties of backpropagation SteveLawrence ClydeLeeGiles AhChung Tsoi 1996 Institute for Advanced Computer Studies University of Maryla Technical report The elements of statistical learning: data mining, inference and prediction TrevorHastie RobertTibshirani JeromeFriedman 2009 Springer 2 edition Deep Learning IanGoodfellow YoshuaBengio AaronCourville 2016 MIT Press Priors for infinite networks MRadford Neal Bayesian Learning for Neural Networks Springer 1996 Computing with infinite networks KIChristopher Williams Advances in neural information processing systems morgan kaufmann publishers 1997 Steps toward deep kernel methods from infinite neural networks TamirHazan TommiJaakkola arXiv:1508.05133 2015 arXiv preprint JaehoonLee YasamanBahri RomanNovak SamuelSSchoenholz JeffreyPennington JaschaSohl-Dickstein _eprint: 1711.00165 Deep Neural Networks as Gaussian Processes 2017 Occam's razor AnselmBlumer AndrzejEhrenfeucht DavidHaussler ManfredKWarmuth Information processing letters 24 6 1987 Elsevier Occam's razor CarlEdward Rasmussen ZoubinGhahramani Advances in neural information processing systems MIT 2001. 1998 Better Deep Learning: Train Faster, Reduce Overfitting, and Make Better Predictions JasonBrownlee Machine Learning Mastery 2018 Self-Normalizing Neural Networks GünterKlambauer ThomasUnterthiner AndreasMayr SeppHochreiter arXiv:1706.02515 arXiv: 1706.02515 September 2017 cs, stat Dropout: A Simple Way to Prevent Neural Networks from Overfitting NitishSrivastava GeoffreyHinton AlexKrizhevsky IlyaSutskever RuslanSalakhutdinov Journal of Machine Learning Research 15 June 2014 Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond BernhardScholkopf AlexanderJSmola 2001 MIT Press Cambridge, MA, USA Robust statistics PJHuber 2009 Wiley New York 2nd edition The influence curve and its role in robust estimation RFrank Hampel Journal of the american statistical association 69 346 1974 Taylor & Francis Outlier detection IradBen-Gal Data mining and knowledge discovery handbook Springer 2005 On-line outlier detection and data cleaning HancongLiu SirishShah WeiJiang Computers & Chemical Engineering 28 9 2004 Efficient Algorithms for Mining Outliers from Large Data Sets SridharRamaswamy RajeevRastogi KyuseokShim Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD '00 the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD '00

New York, NY, USA; Dallas, Texas, USA

Association for Computing Machinery 2000 Least median of squares regression RousseeuwPeter Journal of the American statistical association 79 388 1984 Taylor & Francis Computing LTS regression for large data sets PeterRousseeuw KatrienVan Driessen Data mining and knowledge discovery 12 1 2006 Springer Robust Multilayer Perceptrons: Robust Loss Functions and Their Derivatives JKalina PVidnerová Proceedings of the 21st EANN (Engineering Applications of Neural Networks) 2020 Conference LIliadis PAngelov CJayne EPimenidis the 21st EANN (Engineering Applications of Neural Networks) 2020 Conference

Cham

Cham 2020 event-place Effective automatic method selection for nonlinear regression modelling JKalina ANeoral PVidnerová International Journal of Neural Systems 2021 Individual Comparisons by Ranking Methods FrankWilcoxon Biometrics Bulletin 1 6 1945 Publisher A simple sequentially rejective multiple test procedure StureHolm Scandinavian journal of statistics 1979 JSTOR Should We Really Use Post-Hoc Tests Based on Mean-Ranks AlessioBenavoli GiorgioCorani FrancescaMangili Journal of Machine Learning Research 17 5 2016