<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Comparison of Regularization Techniques for Shallow Neural Networks Trained on Small Datasets</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jiří</forename><surname>Tumpach</surname></persName>
							<email>tumpach@cs.cas.cz</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Mathematics and Physics</orgName>
								<orgName type="institution">Charles University</orgName>
								<address>
									<settlement>Prague</settlement>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">The Czech Academy of Sciences</orgName>
								<orgName type="department" key="dep2">Institute of Computer Science</orgName>
								<address>
									<settlement>Prague</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jan</forename><surname>Kalina</surname></persName>
							<email>kalina@cs.cas.cz</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">The Czech Academy of Sciences</orgName>
								<orgName type="department" key="dep2">Institute of Computer Science</orgName>
								<address>
									<settlement>Prague</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><surname>Holeňa</surname></persName>
							<email>martin@cs.cas.cz</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">The Czech Academy of Sciences</orgName>
								<orgName type="department" key="dep2">Institute of Computer Science</orgName>
								<address>
									<settlement>Prague</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Comparison of Regularization Techniques for Shallow Neural Networks Trained on Small Datasets</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">8980AF111B18F67704FEF510351AF622</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T12:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Neural networks are frequently used as regression models. Their training is usually difficult when the model is subject to a small training dataset with numerous outliers.</p><p>This paper investigates the effects of various regularisation techniques that can help with this kind of problem. We analysed the effects of the model size, loss selection, L2 weight regularisation, L2 activity regularisation, Dropout, and Alpha Dropout.</p><p>We collected 30 different datasets, each of which has been split by ten-fold cross-validation. As an evaluation metric, we used cumulative distribution functions (CDFs) of L1 and L2 losses to aggregate results from different datasets without a considerable amount of distortion. Distributions of the metrics are shown, and thorough statistical tests were conducted.</p><p>Surprisingly, the results show that Dropout models are not suited for our objective. The most effective approach is the choice of model size and L2 types of regularisations.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Neural networks are nature-inspired regression models increasingly important in machine learning. This type of model excels in predictive power but it has a poor robustness to outliers, if the training dataset has a small number of samples, the target function is complicated, or the network is over-parametrized for the problem <ref type="bibr" target="#b1">[1,</ref><ref type="bibr" target="#b2">2,</ref><ref type="bibr" target="#b3">3]</ref>.</p><p>On the other hand, novel theoretical analyses show different perspective on neural networks. In <ref type="bibr" target="#b4">[4,</ref><ref type="bibr" target="#b5">5]</ref> the authors investigate the effect of priors over weights for infinitely wide single layer neural network and show that a Gaussian prior results in a Gaussian process prior over its functions. The Gaussian process is a smooth non-parametric model well known for its generalisation properties, so it leads to the conjecture that there is no need to avoid overfitting of such a network. That idea was further generalised to two-layer neural networks in <ref type="bibr" target="#b6">[6]</ref> and general deep neural networks in <ref type="bibr" target="#b7">[7]</ref>. Experiments in <ref type="bibr" target="#b7">[7]</ref> show that finitewidth neural networks approach the infinite counterparts through increasing their width. The authors of <ref type="bibr" target="#b7">[7]</ref> further pointed out that Dropout could be an interesting potential improvement.</p><p>In this paper we are interested in these areas where the network should struggle because we intend to use neural networks as approximation for surrogate modeling in black-box optimisation. Surrogate models are local models that estimate an unknown function in order to select better candidates for evaluation and in this way reduce the cost of optimisation.</p><p>That motivated the research reported in this paper, in which we have investigated 5 different regularisation techniques in different configurations on 30 different datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Regularisation</head><p>Regularisation is a broad term used for methods that add some new prior belief <ref type="bibr" target="#b2">[2,</ref><ref type="bibr" target="#b3">3]</ref> to a specific machine learning method. The belief should redefine the problem to achieve a solution modified in the sense described by Occam's razor principle <ref type="bibr" target="#b8">[8,</ref><ref type="bibr" target="#b9">9]</ref> -more complex hypotheses are less likely than simple hypothesis. For example, the lasso/ridge shrinkage methods in linear regression add a new term that makes small values more suitable as the solution to a problem <ref type="bibr" target="#b2">[2]</ref>.</p><p>The networks size regularisation As with many other regression models, the number of free parameters has critical consequence <ref type="bibr" target="#b3">[3]</ref>. Models with a small number of free parameters can handle only simple relationships, while large models can be more flexible. On the other hand, a large model needs more samples in order to achieve more reliable predictions for all parameters. If that requirement is not met, the regression can over-fit -the model finds some non-sensible but possible relationships in the training dataset, which not valid in general.</p><p>Weight regularisation One possible solution to the free parameters problem is a restriction of parameter domains. In neural networks, it is done using weight regularisation. In fact, the domains of the parameters are unchanged, but the probability of larger values is strongly reduced because of an alternation of an optimisation objective. For example, the L2 type of the weight regularisation adds new term L w2 to the loss of particular network. It is defined as</p><formula xml:id="formula_0">L w2 = ∑ i w 2 i (1)</formula><p>Where w i stands for the value of the i-th parameter. It is usually applied only to weights, not to biases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Activity regularisation</head><p>The third type of regularisation is the activity regularisation <ref type="bibr" target="#b10">[10]</ref>. In short, it penalises big values coming from neurons. The effect may seem similar to weight regularisation, but it may have more potential in cases where the size of a layer is large enough. The reason is that the weighted activities could count up to large numbers in spite of small values of the weights. Activity regularisation is a way of making the input information denser which is a nice property that is commonly utilised in autoencoder-type neural networks <ref type="bibr" target="#b3">[3,</ref><ref type="bibr" target="#b10">10]</ref>.</p><p>Dropout Dropout technique essentially mimics the bagging technique, which is regularly used for improving generalisations of multiple models by aggregating the results <ref type="bibr" target="#b11">[11,</ref><ref type="bibr" target="#b12">12,</ref><ref type="bibr" target="#b3">3]</ref>.</p><p>When Dropout is applied to a specific layer, the training and testing phases differ. When the model is in the training stage, the results are randomly dropped -replaced with zeros. Therefore the next layer is forced to adapt to this incomplete information. In the testing phase, the random sampling is replaced by a multiplicative constant in order to maintain mean values of activation for the next layer 1 .</p><p>Consequently, it increases the robustness of the model and does not require any other model to train. The main difference compared with bagging is that the models in Dropout are dependent -they share weights. Such a sharing is illustrated in Figure <ref type="figure">1</ref>.</p><p>Alpha Dropout Standard Dropout is suited for rectified linear units because zero is the default value of this activation <ref type="bibr" target="#b11">[11]</ref>. Alpha Dropout is a slight modification for smoother activation functions. It deals not only with the mean, but also with the variance. It is based on maintaining a walking average of neurons' outputs and scaling them accordingly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Loss functions</head><p>A crucial part of any machine learning model selection is the definition of a loss function (prediction error measure, performance measure) <ref type="bibr" target="#b13">[13,</ref><ref type="bibr" target="#b2">2]</ref>. The loss function should be fast, convex and should match the random noise that can be found in the data. Frequently, the Mean Absolute Error (MAE) and Mean Square Error (MSE) functions are selected because the corresponding noise is additive and generated by Laplacian and Gaussian distribution, respectively. In addition, also the Huber loss function (cf. <ref type="bibr" target="#b14">[14]</ref>) is commonly used. These three loss functions are defined 1 An alternative is to use the constant in the training phase. . . . . . . . . .</p><formula xml:id="formula_1">I 1 I 2 I 3 I n H 1 H n</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Input layer</head><p>Hidden layer Hidden layer 2</p><p>Figure <ref type="figure">1</ref>: Dropout regularisation. The red circle depicts the neuron where Dropout regularisation causes the output to be masked -in the current iteration, the output is set to zero, and all following layers are computed as normal. This effect causes the updates of immediate incoming connections of that neuron to be zero, but other updates can still modify all other previous weights through non-dropped neurons. In this case, all red edges represent changes brought by gradients from other neurons that may influence the dropped neuron in the following iterations.</p><p>as:</p><formula xml:id="formula_2">MSE(D) = 1 2|D| ∑ |D| i=1 (y i − ŷi ) 2 (2) MAE(D) = 1 |D| ∑ |D| i=1 |y i − ŷi |<label>(3)</label></formula><formula xml:id="formula_3">Huber(D) = 1 2|D| ∑ |D| i=1 min (y i − ŷi ) 2 , 2|y i − ŷi | − 1 ,<label>(4)</label></formula><p>where D is the dataset on which the loss is calculated, |D| is its size, y i is the target value of the i-th sample, and ŷi is its prediction.</p><p>It is common to assume that the dataset is outlier-free and normally distributed; therefore the MSE is the first choice regarding the selection of the loss function.</p><p>MAE/Huber losses are good replacement whenever the data are known to have outliers, or the MSE has not performed well for an unknown reason.</p><p>Robust loss functions A common way of dealing with outliers is to remove them from training data or choose an entirely different model<ref type="foot" target="#foot_0">2</ref>  <ref type="bibr" target="#b2">[2]</ref>.</p><p>Even though the outlier removal has been thoroughly studied, the exact definition of an outlier highly depends on the problem we want to solve. There exists definitions of an outlier relying on median absolute deviation <ref type="bibr" target="#b15">[15]</ref>, quantile and medoid <ref type="bibr">[16]</ref>, online Kalman filter <ref type="bibr" target="#b17">[17]</ref> or nearest neighbour based filtering <ref type="bibr" target="#b18">[18]</ref>.</p><p>A different approach is proposed in <ref type="bibr" target="#b19">[19]</ref> and improved in <ref type="bibr" target="#b20">[20]</ref> where authors deal with robust linear regression by removing the most prominent residuals in the loss function. That idea was further adapted for neural networks in <ref type="bibr" target="#b21">[21]</ref> or nonlinear regression with a known regression function in <ref type="bibr" target="#b22">[22]</ref>. Essentially, these methods exploit the idea that neural networks can learn algorithms (hypothesis). With an assumption that more complex algorithms are harder to learn, the prior belief that reduces the probability of more complex hypotheses also serves as an outlier removal tool.</p><p>Extensions Least Trimmed Squares (LTS) and Least Trimmed Absolute Deviations (LTA) of MSE (2) and MAE (3) that follow the approach recalled in the previous paragraph and we used them in out analysis are defined in the following way</p><formula xml:id="formula_4">LTS(D) = 1 0.9|D| ∑ |D| i=1 ρ (y i − ŷi ) 2 (5) LTA(D) = 1 0.9|D| ∑ |D| i=1 ρ (|y i − ŷi |) ,<label>(6)</label></formula><p>where ρ(x i ) = x i if less than 90 % of residuals 0 otherwise 3 Methodology</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Datasets and their preparation</head><p>We selected 30 datasets containing a relatively small number of samples. These are real-world as well as artificially generated publicly available datasets, for which a nonlinear regression model (i.e. explaining a given variable as a response against predictors under uncertainty) is a meaningful task. The list of the 30 datasets is presented in Table <ref type="table" target="#tab_0">1</ref>. Only datasets without missing values were selected.</p><p>A ten-fold validation has been employed in order to obtain more reliable results. If the dataset had less than ten samples, we used leave-one-out cross-validation instead.</p><p>Each feature was standardised according to training data in a specific fold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Aggregation of results</head><p>It is not possible to visualize the results of regression methods across multiple datasets and loss functions. For example, some datasets are easier than others, and one loss function highlights outliers more, so most of the loss is made of one sample. We tackle this problem by separating the specific dataset, fold, and function in a separate bin. In this bin, we learn the order of results creating empirical cumulative distribution function (ECDF). Every result in a specific bin </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Statistical tests</head><p>We have used only non-parametric blocking statistical tests because the results have limited values, and we wanted to utilise as much information as possible. The Friedman test was used to decide whether a particular view on some hyperparameter includes is drawn from the same distribution or not. At this point, the ECDF mapping is not needed because the test is non-parametric. All statistical tests use the usual 5% significance level. Multiple comparison tests were done using Wilcoxon signed-rank test <ref type="bibr" target="#b23">[23]</ref> with Holm correction <ref type="bibr" target="#b24">[24]</ref> instead of mean-ranks post-hoc tests which can create inconsistencies and paradoxical situations in machine learning scenarios <ref type="bibr" target="#b25">[25]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Architecture</head><p>We selected three-layered architecture. The first layer has T neurons, the second layer has always T /2 and the third layer always has one neuron. The first two layers have Scaled Exponential Linear Units (SELU) as an activation function, and the third layer has a linear function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Training</head><p>We trained our models with a NAdam optimiser with a 0.001 learning rate. We use early stopping with patience = 10 and delta = 1e −10 to speed up the training. Even though this is another type of regularisation, we use it in such a manner that its effect is minuscule. The maximal number of epochs is set to 10000, and batch size is equivalent to the size of the largest dataset. In the first set of experiments, we produced 48 014 neural networks and their results. In the second set, we managed to prepare 178 092 models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Dropout regularisation</head><p>In the first experiment, we compared between 3 settings -no regularisation, dropout regularisation and, alpha dropout regularisation. Both dropout techniques are set to 50% probability. The results are in Figure <ref type="figure" target="#fig_2">2</ref>, number of models that are better than the same hyperparameter counterpart can be seen in Table <ref type="table" target="#tab_1">2b</ref> for L1 loss and Table <ref type="table" target="#tab_1">2c</ref> for L2 loss. We highlighted in bold values that Wilcoxon signed-rank test with Holm correction found significantly better than the column value.</p><p>It seems that the regularisation does not help. It may be caused by the exaggerated value of the Dropout rate or a need for such models to have wider layers. We do not know the reason why Alpha Dropout performed so badlyit should be better because we used SELU as an activation function.</p><p>One possible explanation for this poor performance is the use of early stopping. In the training phase, the dropout causes the output to be stochastic, so the error is stochastic too. The stochastic error can cause accidental results, which can stop the training prematurely. Because we have one mini-batch, the variance is too significant not to be perceptible.</p><p>Though not tried in our experiments, a possible remedy could be early stopping variation where the error is exponentially smoothed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Size of models</head><p>In the second and third experiments, we were interested in network size and its effect on performance. Figure <ref type="figure" target="#fig_4">3a</ref> and Table <ref type="table">4</ref> show non-regularised models and Figure <ref type="figure" target="#fig_4">3b</ref> and Table <ref type="table" target="#tab_3">5</ref> show the Dropout variants combined together. Non-regularized results are better than the Dropout variants, which are less stable and have delayed response on the increase of network size.</p><p>The stability may come from the same source as the previous problem -the early stopping could make the model undertrained. The delay may be the result of the selected dropout rate. Because we used a dropout rate of 50%, the real amount of usable information can be effectively halved in each hidden layer (given that there is no space or resources to make the information denser). Together it is a 4x delay which is not enough to explain the findings (the optimum size of the model is 2 4 vs 2 7 ). Possible other reasons could be • the difficulty of encoding uncertain patterns • undertraining, due to early stopping</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Loss function</head><p>In the fourth experiment, we analyzed the effect of a loss function for models without regularisation. Trimmed variants performed poorly probably because they remove some residuals (10%) and, therefore, reduce dataset size even more. In our case, Mean Squared Error (MSE) is better fitted than Mean Average Error (MAE). From the distribution in Figure <ref type="figure" target="#fig_5">4</ref> it seems that MSE has much worse results, but the median value (shown as the white point in the central part of the graph) of MSE is better than that of MAE. The best loss function is the Huber loss. All results can be seen in Table <ref type="table" target="#tab_4">6</ref>.</p><p>The Huber loss combines benefits of both worlds because its derivatives are dependent on the size of error (from MSE) while limiting the maximum value (from MAE). This effect may be responsible for the best result among the considered loss functions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Weight regularisation</head><p>The weight regularisation has a prominent effect on the results, as revealed in Figure <ref type="figure" target="#fig_6">5</ref>. Too much is certainly worse than no weight normalisation, but suitable values significantly reduce bad results.           If the regularisation is too high, the loss is effectively replaced only with the term that reduces weights on the network's connections. If it is too low, the network can lack regularisation -creating potentially volatile responses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Activity regularisation</head><p>In our case, the effect of activity regularisation is similar but smaller than the weight penalty. The difference in weight and activity regularisation effectiveness can be explained by the specific activation used in training. The results are in Figure <ref type="figure" target="#fig_7">6</ref> and in Table <ref type="table" target="#tab_6">8</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Alpha Dropout rate</head><p>In Figure <ref type="figure" target="#fig_8">7</ref> and Table <ref type="table" target="#tab_7">9</ref> the effects of Alpha Dropout rate can be seen. It may be good to investigate smaller values more because the 0.1 rate is the best. The preference for not having this regularisation can be explained equally as in the subsection 4.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper, we analyzed several types of regularisation techniques on databases where effective hyperparameter optimization is not possible due to the lack of samples or the existence of outliers in the database. We showed that Dropout techniques in these scenarios are not a good choice because their results are not stable enough to compete with models without regularisation. The model's size is an essential aspect, and it seems that the optimum has a far bigger number of free parameters than the theoretical number computed using the average across our training databases. Huber loss function is the best because it does not suffer from inconsistencies of MAE or MSE losses. Trimmed variants of loss functions <ref type="bibr" target="#b21">[21]</ref> performed poorly here, but they may be better if a particular dataset has more samples than we had. The third best hyperparameter to look for is the weight normalization -small weight dramatically reduces the frequency of bad results while keeping the median of results low.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Acknowledgement</head><p>The research reported in this paper has been supported by SVV project number 260 575 and partially supported by the Czech Science Foundation (GA ČR) projects 18-18080S and 19-05704S.</p><p>Computational resources were supplied by the project "e-Infrastruktura CZ" (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>(a) Distributions of scaled results by empirical cumulative distribution functions for each dataset and loss separately.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>Statistical tests for L2 loss function</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The first experiment compared different kinds of dropout regularizers across all different combinations of datasets and hyperparameters. The tables display the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold. The statistical tests clearly prefer models without dropout.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>(a) Distributions of test losses for non-regularized models. (b) Distributions of test losses for Dropout and Alpha Dropout models combined together.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: The second and third experiment analyses the effect of model size on regularized and non-regularized models. The regularisation delays over-training but does not improve the results.</figDesc><graphic coords="6,56.69,535.13,481.90,189.47" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: The results of the fourth experiment -comparison between loss functions for non-regularized models. MSE has a lot of good and bad results; Huber seems to be the best; trimmed losses are the worst.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: The fifth experiment exposes the effect of L2 normalisation weight.</figDesc><graphic coords="7,56.69,88.94,481.90,182.66" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: The sixth experiment exposes the effect of L2 activity normalisation weight.</figDesc><graphic coords="7,56.69,318.37,481.90,182.66" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: The seventh experiment exposes the effect of Alpha Dropout rate (probability).</figDesc><graphic coords="7,56.69,547.80,481.90,182.66" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>All datasets considered in the analysis.</figDesc><table><row><cell>Name</cell><cell>no. features</cell><cell>no. samples</cell></row><row><cell cols="2">Concrete Compressive Strength</cell><cell>1030</cell></row><row><cell>The Boston Housing</cell><cell></cell><cell>506</cell></row><row><cell>Auto MPG</cell><cell></cell><cell>398</cell></row><row><cell>Proben1 (3d reg.)</cell><cell></cell><cell>4208</cell></row><row><cell>Misra1a</cell><cell></cell><cell>14</cell></row><row><cell>Chwirut2</cell><cell></cell><cell>54</cell></row><row><cell>Chwirut1</cell><cell></cell><cell>214</cell></row><row><cell>Lanczos3</cell><cell></cell><cell>24</cell></row><row><cell>Gauss1</cell><cell></cell><cell>250</cell></row><row><cell>Gauss2</cell><cell></cell><cell>250</cell></row><row><cell>DanWood</cell><cell></cell><cell>6</cell></row><row><cell>Kirby2</cell><cell></cell><cell>151</cell></row><row><cell>Hahn1</cell><cell></cell><cell>236</cell></row><row><cell>Nelson</cell><cell></cell><cell>128</cell></row><row><cell>MGH17</cell><cell></cell><cell>33</cell></row><row><cell>Lanczos1</cell><cell></cell><cell>24</cell></row><row><cell>Lanczos2</cell><cell></cell><cell>24</cell></row><row><cell>Gauss3</cell><cell></cell><cell>250</cell></row><row><cell>Roszman1</cell><cell></cell><cell>25</cell></row><row><cell>ENSO</cell><cell></cell><cell>168</cell></row><row><cell>MGH09</cell><cell></cell><cell>11</cell></row><row><cell>Thurber</cell><cell></cell><cell>37</cell></row><row><cell>BoxBOD</cell><cell></cell><cell>6</cell></row><row><cell>Rat42</cell><cell></cell><cell>9</cell></row><row><cell>MGH10</cell><cell></cell><cell>16</cell></row><row><cell>Eckerle4</cell><cell></cell><cell>35</cell></row><row><cell>Rat43</cell><cell></cell><cell>15</cell></row><row><cell>Bennett5</cell><cell></cell><cell>154</cell></row><row><cell>Hyperparameter</cell><cell cols="2">Considered values</cell></row><row><cell>Loss</cell><cell cols="2">MSE, MAE, Huber, LTS, LTA</cell></row><row><cell cols="3">Size of the first layer 4, 8, 16, 32, 64, 128, 256, 512,</cell></row><row><cell></cell><cell cols="2">1024, 2048, 4096</cell></row><row><cell cols="3">Model regularisation No regularisation,</cell></row><row><cell></cell><cell>50% Dropout,</cell><cell></cell></row><row><cell></cell><cell cols="2">50% Alpha Dropout</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Options for the first set of experiments.was mapped by the corresponding ECDF, creating normalized order of results in a particular bin. Finally, all results are combined back together.To compare normalized results for a specific hyperparameter, we split combined results by the value. These splits create empirical distributions of normalized results, which a violin plot can reasonably visualize.</figDesc><table><row><cell>Hyperparameter</cell><cell>Considered values</cell></row><row><cell>Loss</cell><cell>MSE, MAE, Huber</cell></row><row><cell cols="2">Size of the first layer 4, 8, 16, 32, 64, 128, 256, 512,</cell></row><row><cell></cell><cell>1024, 2048, 4096, 8192</cell></row><row><cell></cell><cell>L2-weight -0.001, 0.01, 0.1,</cell></row><row><cell>Model regularisation</cell><cell>1, 10</cell></row><row><cell></cell><cell>L2-activity -0.001, 0.01, 0.1,</cell></row><row><cell></cell><cell>1, 10</cell></row><row><cell></cell><cell>Alpha dropout -0.1, 0.2, 0.4,</cell></row><row><cell></cell><cell>0.6, 0.8</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Options for the second set of experiments.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 5 :</head><label>5</label><figDesc>The third experiment compares L1 loss for models of different sizes with Dropout or Alpha Dropout regularization. The table displays the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 6 :</head><label>6</label><figDesc>The fourth experiment compares L1 losses for models trained by optimization of different loss functions without regularization. The table displays the number of experiments where one model size (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.</figDesc><table><row><cell></cell><cell></cell><cell>Huber</cell><cell>MAE</cell><cell>MSE</cell><cell>LTA</cell><cell>LTS</cell></row><row><cell cols="2">Huber</cell><cell></cell><cell cols="4">1854 1614 2111 2184</cell></row><row><cell cols="3">MAE 1237</cell><cell></cell><cell cols="3">1341 2125 2171</cell></row><row><cell cols="5">MSE 1477 1750</cell><cell cols="2">2006 2085</cell></row><row><cell></cell><cell>LTA</cell><cell>980</cell><cell cols="2">966 1085</cell><cell></cell><cell>1750</cell></row><row><cell></cell><cell>LTS</cell><cell>907</cell><cell cols="3">920 1006 1341</cell></row><row><cell></cell><cell cols="5">0.0 0.001 0.01 0.1</cell><cell>1.0 10.0</cell></row><row><cell>0.0</cell><cell></cell><cell cols="2">4430</cell><cell cols="3">3705 5053 6260 8348</cell></row><row><cell cols="2">0.001 5686</cell><cell></cell><cell></cell><cell cols="3">4364 5735 6987 8967</cell></row><row><cell>0.01</cell><cell>6411</cell><cell cols="2">5752</cell><cell cols="3">7086 8278 9601</cell></row><row><cell>0.1</cell><cell>5063</cell><cell cols="2">4381</cell><cell>3030</cell><cell cols="2">8325 9686</cell></row><row><cell>1.0</cell><cell>3856</cell><cell cols="2">3129</cell><cell cols="2">1838 1790</cell><cell>9449</cell></row><row><cell>10.0</cell><cell>1768</cell><cell cols="2">1149</cell><cell>515</cell><cell>430</cell><cell>647</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 7 :</head><label>7</label><figDesc>The fifth experiment exposes the effect of L2 normalisation weight. The table displays the number of experiments where one weight of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.</figDesc><table><row><cell></cell><cell cols="3">0.0 0.001 0.01 0.1</cell><cell>1.0 10.0</cell></row><row><cell>0.0</cell><cell></cell><cell>5197</cell><cell cols="2">4895 4813 5787 7044</cell></row><row><cell cols="2">0.001 4919</cell><cell></cell><cell cols="2">4921 4901 5833 7115</cell></row><row><cell>0.01</cell><cell>5221</cell><cell>5195</cell><cell cols="2">5237 6352 7654</cell></row><row><cell>0.1</cell><cell>5303</cell><cell>5215</cell><cell>4879</cell><cell>7313 8509</cell></row><row><cell>1.0</cell><cell>4329</cell><cell>4283</cell><cell>3764 2803</cell><cell>8631</cell></row><row><cell>10.0</cell><cell>3072</cell><cell>3001</cell><cell cols="2">2462 1607 1485</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 8 :</head><label>8</label><figDesc>The sixth experiment exposes the effect of L2 activity normalisation weight. The table displays the number of experiments where one weight of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.</figDesc><table><row><cell>0.1</cell><cell>0.2</cell><cell>0.4</cell><cell>0.6</cell><cell>0.8</cell></row><row><cell>0.1</cell><cell cols="4">7226 7654 7614 7275</cell></row><row><cell>0.2 2890</cell><cell></cell><cell cols="3">7075 6813 6335</cell></row><row><cell cols="2">0.4 2462 3041</cell><cell></cell><cell cols="2">6137 5684</cell></row><row><cell cols="3">0.6 2502 3303 3979</cell><cell></cell><cell>5449</cell></row><row><cell cols="4">0.8 2841 3781 4432 4667</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 9 :</head><label>9</label><figDesc>The seventh experiment exposes the effect of Alpha Dropout rate (probability). The table displays the number of experiments where one rate of the regularization (row) is better than the other (column). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value is highlighted in bold.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">Like k-NN or regression tree.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">The second experiment compares L1 loss for models of different sizes without regularization. The table displays the number of experiments where one model size (row)</title>
		<idno>1030 1285 1682 2087 8 1640 1089 802 759 849 926 1088 1412 1855 2197 16 1862 1721 982 892 932 1024 1210 1654 2065 2373 32 2073 2008 1828 1268 1255 1247 1490 1967 2335 2542 64 2093 2051 1918 1542 1347 1355 1620 2111 2438 2588 128 2015 1961 1878 1555 1463 1406 1740 2193 2453 2614 256 1940 1884 1786 1563 1455 1404 1792 2280 2514 2623 512 1780 1722 1600 1320 1190 1070 1018 2127 2417 2585 1024 1525 1398 1156 843 699 617 530 683 2083 2411 2048 1128</idno>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="955" to="745" />
		</imprint>
	</monogr>
	<note>). If the one-sided Wilcoxon signed-rank test with Holm correction rejected a null hypothesis (column is better than row), the value</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">What size neural network gives optimal generalization? convergence properties of backpropagation</title>
		<author>
			<persName><forename type="first">Steve</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Clyde</forename><forename type="middle">Lee</forename><surname>Giles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ah</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tsoi</forename></persName>
		</author>
		<imprint>
			<date type="published" when="1996">1996</date>
		</imprint>
		<respStmt>
			<orgName>Institute for Advanced Computer Studies University of Maryla</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">The elements of statistical learning: data mining, inference and prediction</title>
		<author>
			<persName><forename type="first">Trevor</forename><surname>Hastie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Tibshirani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jerome</forename><surname>Friedman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
	<note>2 edition</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Deep Learning</title>
		<author>
			<persName><forename type="first">Ian</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshua</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aaron</forename><surname>Courville</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
			<publisher>MIT Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Priors for infinite networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><surname>Neal</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Bayesian Learning for Neural Networks</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1996">1996</date>
			<biblScope unit="page" from="29" to="53" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Computing with infinite networks</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">I</forename><surname>Christopher</surname></persName>
		</author>
		<author>
			<persName><surname>Williams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<publisher>morgan kaufmann publishers</publisher>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="295" to="301" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Steps toward deep kernel methods from infinite neural networks</title>
		<author>
			<persName><forename type="first">Tamir</forename><surname>Hazan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tommi</forename><surname>Jaakkola</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1508.05133</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">Jaehoon</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yasaman</forename><surname>Bahri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Roman</forename><surname>Novak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Samuel</forename><forename type="middle">S</forename><surname>Schoenholz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Pennington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jascha</forename><surname>Sohl-Dickstein</surname></persName>
		</author>
		<idno>_eprint: 1711.00165</idno>
		<title level="m">Deep Neural Networks as Gaussian Processes</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Occam&apos;s razor</title>
		<author>
			<persName><forename type="first">Anselm</forename><surname>Blumer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrzej</forename><surname>Ehrenfeucht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Haussler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manfred</forename><forename type="middle">K</forename><surname>Warmuth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information processing letters</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="377" to="380" />
			<date type="published" when="1987">1987</date>
			<publisher>Elsevier</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Occam&apos;s razor</title>
		<author>
			<persName><forename type="first">Carl</forename><surname>Edward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rasmussen</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Zoubin</forename><surname>Ghahramani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<publisher>MIT</publisher>
			<date type="published" when="1998">2001. 1998</date>
			<biblScope unit="page" from="294" to="300" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Better Deep Learning: Train Faster, Reduce Overfitting, and Make Better Predictions</title>
		<author>
			<persName><forename type="first">Jason</forename><surname>Brownlee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning Mastery</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Self-Normalizing Neural Networks</title>
		<author>
			<persName><forename type="first">Günter</forename><surname>Klambauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andreas</forename><surname>Mayr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sepp</forename><surname>Hochreiter</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1706.02515</idno>
		<idno>arXiv: 1706.02515</idno>
		<imprint>
			<date type="published" when="2017-09">September 2017</date>
		</imprint>
	</monogr>
	<note>cs, stat</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Dropout: A Simple Way to Prevent Neural Networks from Overfitting</title>
		<author>
			<persName><forename type="first">Nitish</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geoffrey</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alex</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ilya</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ruslan</forename><surname>Salakhutdinov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="1929" to="1958" />
			<date type="published" when="2014-06">June 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond</title>
		<author>
			<persName><forename type="first">Bernhard</forename><surname>Scholkopf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexander</forename><forename type="middle">J</forename><surname>Smola</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2001">2001</date>
			<publisher>MIT Press</publisher>
			<pubPlace>Cambridge, MA, USA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Robust statistics</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Huber</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>Wiley</publisher>
			<pubPlace>New York</pubPlace>
		</imprint>
	</monogr>
	<note>2nd edition</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">The influence curve and its role in robust estimation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Frank</surname></persName>
		</author>
		<author>
			<persName><surname>Hampel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the american statistical association</title>
		<imprint>
			<biblScope unit="volume">69</biblScope>
			<biblScope unit="issue">346</biblScope>
			<biblScope unit="page" from="383" to="393" />
			<date type="published" when="1974">1974</date>
			<publisher>Taylor &amp; Francis</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Outlier detection</title>
		<author>
			<persName><forename type="first">Irad</forename><surname>Ben-Gal</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Data mining and knowledge discovery handbook</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="131" to="146" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">On-line outlier detection and data cleaning</title>
		<author>
			<persName><forename type="first">Hancong</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sirish</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Jiang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computers &amp; Chemical Engineering</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page" from="1635" to="1647" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Efficient Algorithms for Mining Outliers from Large Data Sets</title>
		<author>
			<persName><forename type="first">Sridhar</forename><surname>Ramaswamy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rajeev</forename><surname>Rastogi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kyuseok</forename><surname>Shim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;00</title>
				<meeting>the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;00<address><addrLine>New York, NY, USA; Dallas, Texas, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="427" to="438" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Least median of squares regression</title>
		<author>
			<persName><forename type="first">Rousseeuw</forename><surname>Peter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American statistical association</title>
		<imprint>
			<biblScope unit="volume">79</biblScope>
			<biblScope unit="issue">388</biblScope>
			<biblScope unit="page" from="871" to="880" />
			<date type="published" when="1984">1984</date>
			<publisher>Taylor &amp; Francis</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Computing LTS regression for large data sets</title>
		<author>
			<persName><forename type="first">Peter</forename><surname>Rousseeuw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Katrien</forename><surname>Van Driessen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data mining and knowledge discovery</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="29" to="45" />
			<date type="published" when="2006">2006</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Robust Multilayer Perceptrons: Robust Loss Functions and Their Derivatives</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kalina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Vidnerová</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 21st EANN (Engineering Applications of Neural Networks) 2020 Conference</title>
				<editor>
			<persName><forename type="first">L</forename><surname>Iliadis</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Angelov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Jayne</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Pimenidis</surname></persName>
		</editor>
		<meeting>the 21st EANN (Engineering Applications of Neural Networks) 2020 Conference<address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Cham</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="546" to="557" />
		</imprint>
	</monogr>
	<note>event-place</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Effective automatic method selection for nonlinear regression modelling</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kalina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neoral</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Vidnerová</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Neural Systems</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Individual Comparisons by Ranking Methods</title>
		<author>
			<persName><forename type="first">Frank</forename><surname>Wilcoxon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Biometrics Bulletin</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="80" to="83" />
			<date type="published" when="1945">1945</date>
		</imprint>
	</monogr>
	<note>Publisher</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">A simple sequentially rejective multiple test procedure</title>
		<author>
			<persName><forename type="first">Sture</forename><surname>Holm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scandinavian journal of statistics</title>
		<imprint>
			<biblScope unit="page" from="65" to="70" />
			<date type="published" when="1979">1979</date>
			<publisher>JSTOR</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Should We Really Use Post-Hoc Tests Based on Mean-Ranks</title>
		<author>
			<persName><forename type="first">Alessio</forename><surname>Benavoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giorgio</forename><surname>Corani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Francesca</forename><surname>Mangili</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="1" to="10" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
