-

Workshop Proceedings

ASSUMPTION IN STATISTICAL DATA ANALYSIS

S.Ya. Shatskikh

L.E. Melkumova

0 0 Mercury Development Russia , Samara , Russia 1 Samara National Research University , Samara , Russia

2016

1638 763 768

The article is devoted to normality assumption in statistical data analysis. It gives a short historical review of the development of scientific views on the normal law and its applications. It also briefly covers normality tests and analyzes possible consequences of using the normality assumption incorrectly. “Mechanism" of the central limit theorem Normal distribution can serve as a good approximation for processing observations if the random variable in question can be considered as a sum of a large number of independent random variables 1, … , , where each of the variables contributes to the common sum: This theory was developed in the classical limit theorems, namely the De MoivreLaplace theorem, the Lyapunov theorem and the Lindeberg-Feller theorem. By the beginning of the 20th century this argument already had a rather wide support in practice. Many practical experiments confirmed that the observed distributions obey the normal law. The general acknowledgement of the normal law at that point of time was mainly based on observations and measurement results in physical sciences and in particular in astronomy and geodesy. In these studies measurement errors and atmospheric turbulence constituted the main source of randomness (Gauss, Laplace,

normal law normality assumption normal distribution Gaussian

=1 →∞ Lim ℙ { ∑ ( − )⁄( ∑ 2) ≤ } = =1 1 2

1 √2

−∞

2 ∫ − 2 , Bessel).

Karl Pearson distributions

By the end of the 19th century normal distribution had lost its exclusive position. This was the result of many attempts to apply statistical methods to (mainly biological) research results. The distributions that came up in those studies were often asymmetrical or could have various other deviations from normality.

By that time Karl Pearson had suggested a system of 12 continuous distributions (in addition to the normal distribution), which could be used for smoothing empiric data. Today the discrete analogues of the Pearson-type distributions are also known.

Ronald Fisher to Egon Pearson polemic

However in the beginning of the 20th century the normal distribution has restored its value thanks to the thoughtful works of Ronald Fisher, who demonstrated that using the normality assumptions one can make conclusions of wide practical importance. Nevertheless, after the R. Fisher’s book “Statistical methods for research workers” (1925) was published, Egon Pearson (Karl Pearson’s son) had made some critical remarks on if it’s justified to use the normality assumption in the statistical data analysis. According to E. Pearson, many of the tests in the Fisher’s book are based on the normality assumption for populations where the samples are taken from. But the question of the accuracy of the tests for the case when the population distributions depart from normal is never discussed. There is no clear statement, that the tests should be used with great caution in such situations.

Responding to the Pearson’s criticism, Fisher was stating his point of view, based on the statistical data that was obtained in experiments in the field of selection of agricultural plants.

Fisher believed that the biologists check their methods by using control experiments. So the normality assumption was tested by practice and not theory at that time. By the time of this discussion some consequences of breaking the normality law were already known. Errors of this sort have slight effect on the conclusions about the mean values but can be dangerous for conclusions about the variance.

The last decades

By the end of the 20th century wide usage of statistical methods in biology, medicine, sociology and economics led the researchers to a conclusion that there is a wide variety of distributions that can be useful in these sciences. Aside from the normal distribution, distributions with “heavy” tails and asymmetric distributions took the stage. This was caused by the fact that for many problems of these sciences the “mechanism” of the central limit theorem was problematic to establish. Also in contrast to physical sciences, one and the same experiment made with the same conditions can lead to different results.

For this reason the main cause of randomness (aside from the measurement errors) became the influence of various factors that were not taken into account and are interpreted as random.

This state of affairs led to a necessity to develop robust (to random deviations from given assumptions) methods of data analysis. Also there was a need for methods that don’t use the normality assumption, for instance methods of nonparametric statistics [ 7 ].

It’s worth noting, that in recent years these non-normal stable distributions became widely used in theoretical models of economics, financial mathematics and biology [ 8, 10 ].

It’s also worth noting that the stable non-normal Levy distribution was successfully used in the theory of laser cooling (Cohen-Tannoudji, Nobel prize in physics 1997). This theory uses the limit theorem of Levy - Gnedenko about convergence to stable non-Gaussian distributions [ 11 ].

Two quotations J Tukey:

Today we can use the Gaussian shape of distribution in a variety of ways to our profit. We can: a) use it freely as a reference standard, as a standard against which to assess the actual behavior of real data -- doing this by finding and looking at deviations. b) use it, cautiously but frequently, as a crude approximation to the actual behavior, both of data itself and of quantities derived from data.

In using the Gaussian shape as such an approximation, we owe it to ourselves to keep in mind that real data will differ from the Gaussian shape in a variety of ways, so that treating the Gaussian case is only the beginning.[ 5 ]

Henri Poincaré:

There must be something mysterious in the normal law, since mathematicians think that this is the law of nature and physicists think this is a mathematical theorem.

Testing sample distributions for normality Pearson’s chi-squared normality test

Since the Gaussian distribution is continuous and it has two unknown parameters mean and variance - when using the Pearson’s test, the sample is usually divided into classes and unknown values of the two parameters are replaced by their statistical estimates. As a result the limit distribution of the 2 statistics will not be asymptotically equal to the chi-squared distribution with − 3 degrees of freedom. The distribution function of the 2 statistics will lie lower, which means that the level of significance will be less than the nominal level. There are a few authors who claim that the chi-squared test is not a good choice for testing normality (see for example [ 9 ]).

Kolmogorov-Lilliefors test

Sometimes the Kolmogorov test, omega-squared test, chi-squared test can be used incorrectly to test normality of sample distribution. The Kolmogorov test is used to test the hypothesis that the sample is taken from a population with a known and completely defined continuous distribution function.

When testing normality of the distribution one can be unaware of the exact values of the mean and the variance. However it’s well known that when the parameters of these distributions are replaced by their sample estimates, the normality assumption is accepted more often than it should be.

Besides in this case to test normality reliably one needs samples of large size (several hundred of observations). It’s difficult to guarantee uniformity of observations for samples of this size [ 7 ].

Some recommendations for using the statistical tests can be found in [ 9 ]. Often it’s appropriate to use the Lilliefors version of the Kolmogorov test.

Other normality tests

Starting the 30s many different distribution normality tests were developed. Some of the examples are: Cramer-von Mises test, Kolmogorov-Smirnov test, AndersonDarling test, Shapiro-Wilk test, D’Agostino test and others (see [ 3 ]).

The Kolmogorov-Lilliefors and the Shapiro-Wilk normality tests are implemented in the Statistica and R software.

Consequences of breaking the normality assumption

The Student’s t-statistics and the Fisher F-statistics relate to the case when the observed values have normal distribution and correlation between the observations is equal to zero. If (as it’s usually the case) the distribution of the observations is not normal, then the distribution of the t and the F statistics differs from those, described above, especially for the F-statistics.

Comparing means of two samples

The most widely used test for comparing means of two samples with equal variance is based on Student’s t-statistics. In this case the observations must be independent and with zero hypothesis must have equal normal distributions. When the distributions are not normal the level of significance of the t-test becomes almost accurate for the sample size more than 12.

Yet if variances of two samples are different, the Student’s t-test will not give exact values for the levels of significance even for normal distributions (Behrens-Fisher problem, having no exact solutions today, only approximate).

Comparing variances of two samples

The test of equality of variances for two independent normal samples is based on the Fisher’s F-statistics. The Fisher’s test based on the F-statistics is very sensitive to deviations from normality.

Large samples

For large samples the law of large numbers and the central limit theorem “mechanism” both work. With corresponding norming applied the sample mean of the large number of observations will be close to the mean or will have a distribution close to normal, even if the observations themselves do not have normal distribution. In this situation the means of a large number of the observation squares (Pearson’s chi-squared statistics), as a rule, have almost chi-squared distributions. We should keep in mind that proximity to the normal distribution and the chi-squared distribution depends on the sample size and the observation distributions. Another example is related to the maximum likelihood estimates, which have many useful properties. However some of these properties hold only for very large samples. In real practice the samples are almost never very large.

Distribution of the Pearson’s sample correlation coefficient r

Let be the correlation coefficient of a couple of random variables and : = = {( − { })( − { })} √ { } { }

, ∑ =1( − ̅)( − ̅) √∑ =1 ( − ̅)2 ∗ ∑ =1

( − ̅)2 and let be the Pearson’s correlation coefficient for a bivariate sample of observations of these variables.

For random variables and

with bivariate Gaussian distribution when ≠ 0 the distribution function and the density of the Pearson’s correlation coefficient can not be expressed via elementary functions but it can be represented using the hypergeometric function. For the case when = 0, representations of the correlation coefficient density via elementary functions are known.

When = 0 for large samples (the sample size → ∞) the Pearson’s correlation coefficient has asymptotically normal distribution.

However the convergence of the coefficient to the normal distribution is too slow. It’s not recommended to use normal approximation when < 500. In this case the Fisher transformation for the coefficient can be used. It leads to a new variable , that has a distribution which is much more close to normal. Using this distribution it’s possible to find confidence intervals for the coefficient.

The research of the problem of sensitivity to deviations from the normal distribution of the coefficient cannot be considered complete by this time. One of the reasons is that the distributions of for non-normal samples are developed in detail for a relatively small number of certain cases. There are examples when the sensitivity of to deviations from normality is high as well as examples when it’s rather insignificant [ 2 ].

Acknowledgements References

This work was partially supported by a grant of RFBR (project 16-01-00184 A)

1. Good

, Hardin

. Common errors in statistics, and how to avoid them . Wiley, 2003 .

2. Johnson

, Kotz

, Balakrishnan

Continuous univariate distribution . Vol. 2 . 2ed, Wiley, 1995 .

3. Kotz

, et al. Encyclopedia of statistical science , 16 volumes. 2ed, Wiley, 2005 .

4. Lemann

On the history and use of some standard statistical models . Probability and Statistics: Essays in Honor of D. Freedman . Vol. 2 , 2008 .

5. Tukey

. Exploratory data analysis . Addison Wesley , 1977 .

6. Kоbzar

АI

. Applied mathematical statistics. M.: FM , 2006 . [In Russian]

7. Lagutin

. Pictorial mathematical statistics. M.: BINOM , 2007 . [In Russian]

8. Prokhorov YuV (ed.) Probability and Mathematical Statistics. Encyclopedia. M.: GRE , 1999 . [In Russian]

9. Tyurin

YuN

, Makarov AA. Statistical data analysis on the computer . M.: INFRA-M , 1998 . [In Russian]

10. Shiryaev

. Essentials of stochastic finance: facts, models , theory, World Scientific, 1999 .

11. Cohen-Tannoudji

(et al.) Levy statistics and laser cooling . Cambridge university press, 2002 .