<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop Proceedings</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ASSUMPTION IN STATISTICAL DATA ANALYSIS</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>S.Ya. Shatskikh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L.E. Melkumova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mercury Development Russia</institution>
          ,
          <addr-line>Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Samara National Research University</institution>
          ,
          <addr-line>Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1638</volume>
      <fpage>763</fpage>
      <lpage>768</lpage>
      <abstract>
        <p>The article is devoted to normality assumption in statistical data analysis. It gives a short historical review of the development of scientific views on the normal law and its applications. It also briefly covers normality tests and analyzes possible consequences of using the normality assumption incorrectly. “Mechanism" of the central limit theorem Normal distribution can serve as a good approximation for processing observations if the random variable in question can be considered as a sum of a large number of independent random variables  1, … ,   , where each of the variables contributes to the common sum: This theory was developed in the classical limit theorems, namely the De MoivreLaplace theorem, the Lyapunov theorem and the Lindeberg-Feller theorem. By the beginning of the 20th century this argument already had a rather wide support in practice. Many practical experiments confirmed that the observed distributions obey the normal law. The general acknowledgement of the normal law at that point of time was mainly based on observations and measurement results in physical sciences and in particular in astronomy and geodesy. In these studies measurement errors and atmospheric turbulence constituted the main source of randomness (Gauss, Laplace,</p>
      </abstract>
      <kwd-group>
        <kwd>normal law</kwd>
        <kwd>normality assumption</kwd>
        <kwd>normal distribution</kwd>
        <kwd>Gaussian</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>=1
 →∞
Lim ℙ { ∑ (  −   )⁄( ∑   2)
≤  } =

 =1
1
2</p>
      <p>1
√2</p>
      <p>−∞</p>
      <p>2
∫  − 2  ,
Bessel).</p>
    </sec>
    <sec id="sec-2">
      <title>Karl Pearson distributions</title>
      <p>By the end of the 19th century normal distribution had lost its exclusive position. This
was the result of many attempts to apply statistical methods to (mainly biological)
research results. The distributions that came up in those studies were often
asymmetrical or could have various other deviations from normality.</p>
      <p>By that time Karl Pearson had suggested a system of 12 continuous distributions (in
addition to the normal distribution), which could be used for smoothing empiric data.
Today the discrete analogues of the Pearson-type distributions are also known.</p>
    </sec>
    <sec id="sec-3">
      <title>Ronald Fisher to Egon Pearson polemic</title>
      <p>However in the beginning of the 20th century the normal distribution has restored its
value thanks to the thoughtful works of Ronald Fisher, who demonstrated that using
the normality assumptions one can make conclusions of wide practical importance.
Nevertheless, after the R. Fisher’s book “Statistical methods for research workers”
(1925) was published, Egon Pearson (Karl Pearson’s son) had made some critical
remarks on if it’s justified to use the normality assumption in the statistical data
analysis. According to E. Pearson, many of the tests in the Fisher’s book are based on the
normality assumption for populations where the samples are taken from. But the
question of the accuracy of the tests for the case when the population distributions depart
from normal is never discussed. There is no clear statement, that the tests should be
used with great caution in such situations.</p>
      <p>Responding to the Pearson’s criticism, Fisher was stating his point of view, based on
the statistical data that was obtained in experiments in the field of selection of
agricultural plants.</p>
      <p>Fisher believed that the biologists check their methods by using control experiments.
So the normality assumption was tested by practice and not theory at that time.
By the time of this discussion some consequences of breaking the normality law were
already known. Errors of this sort have slight effect on the conclusions about the
mean values but can be dangerous for conclusions about the variance.</p>
    </sec>
    <sec id="sec-4">
      <title>The last decades</title>
      <p>By the end of the 20th century wide usage of statistical methods in biology, medicine,
sociology and economics led the researchers to a conclusion that there is a wide
variety of distributions that can be useful in these sciences. Aside from the normal
distribution, distributions with “heavy” tails and asymmetric distributions took the stage.
This was caused by the fact that for many problems of these sciences the
“mechanism” of the central limit theorem was problematic to establish. Also in contrast to
physical sciences, one and the same experiment made with the same conditions can
lead to different results.</p>
      <p>For this reason the main cause of randomness (aside from the measurement errors)
became the influence of various factors that were not taken into account and are
interpreted as random.</p>
      <p>
        This state of affairs led to a necessity to develop robust (to random deviations from
given assumptions) methods of data analysis. Also there was a need for methods that
don’t use the normality assumption, for instance methods of nonparametric statistics
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        It’s worth noting, that in recent years these non-normal stable distributions became
widely used in theoretical models of economics, financial mathematics and biology
[
        <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
        ].
      </p>
      <p>
        It’s also worth noting that the stable non-normal Levy distribution was successfully
used in the theory of laser cooling (Cohen-Tannoudji, Nobel prize in physics 1997).
This theory uses the limit theorem of Levy - Gnedenko about convergence to stable
non-Gaussian distributions [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Two quotations</title>
      <sec id="sec-5-1">
        <title>J Tukey:</title>
        <p>Today we can use the Gaussian shape of distribution in a variety of ways to our
profit. We can:
a) use it freely as a reference standard, as a standard against which to assess the
actual behavior of real data -- doing this by finding and looking at deviations.
b) use it, cautiously but frequently, as a crude approximation to the actual behavior,
both of data itself and of quantities derived from data.</p>
        <p>
          In using the Gaussian shape as such an approximation, we owe it to ourselves to keep
in mind that real data will differ from the Gaussian shape in a variety of ways, so that
treating the Gaussian case is only the beginning.[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Henri Poincaré:</title>
        <p>There must be something mysterious in the normal law, since mathematicians think
that this is the law of nature and physicists think this is a mathematical theorem.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Testing sample distributions for normality</title>
      <sec id="sec-6-1">
        <title>Pearson’s chi-squared normality test</title>
        <p>
          Since the Gaussian distribution is continuous and it has two unknown parameters
mean and variance - when using the Pearson’s test, the sample is usually divided into
 classes and unknown values of the two parameters are replaced by their statistical
estimates. As a result the limit distribution of the  2 statistics will not be
asymptotically equal to the chi-squared distribution with  − 3 degrees of freedom. The
distribution function of the  2 statistics will lie lower, which means that the level of
significance will be less than the nominal level. There are a few authors who claim that the
chi-squared test is not a good choice for testing normality (see for example [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]).
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Kolmogorov-Lilliefors test</title>
        <p>Sometimes the Kolmogorov test, omega-squared test, chi-squared test can be used
incorrectly to test normality of sample distribution. The Kolmogorov test is used to
test the hypothesis that the sample is taken from a population with a known and
completely defined continuous distribution function.</p>
        <p>When testing normality of the distribution one can be unaware of the exact values of
the mean and the variance. However it’s well known that when the parameters of
these distributions are replaced by their sample estimates, the normality assumption is
accepted more often than it should be.</p>
        <p>
          Besides in this case to test normality reliably one needs samples of large size (several
hundred of observations). It’s difficult to guarantee uniformity of observations for
samples of this size [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          Some recommendations for using the statistical tests can be found in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Often it’s
appropriate to use the Lilliefors version of the Kolmogorov test.
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>Other normality tests</title>
        <p>
          Starting the 30s many different distribution normality tests were developed. Some of
the examples are: Cramer-von Mises test, Kolmogorov-Smirnov test,
AndersonDarling test, Shapiro-Wilk test, D’Agostino test and others (see [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]).
        </p>
        <p>The Kolmogorov-Lilliefors and the Shapiro-Wilk normality tests are implemented in
the Statistica and R software.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Consequences of breaking the normality assumption</title>
      <p>The Student’s t-statistics and the Fisher F-statistics relate to the case when the
observed values have normal distribution and correlation between the observations is
equal to zero. If (as it’s usually the case) the distribution of the observations is not
normal, then the distribution of the t and the F statistics differs from those, described
above, especially for the F-statistics.</p>
      <sec id="sec-7-1">
        <title>Comparing means of two samples</title>
        <p>The most widely used test for comparing means of two samples with equal variance is
based on Student’s t-statistics. In this case the observations must be independent and
with zero hypothesis must have equal normal distributions. When the distributions are
not normal the level of significance of the t-test becomes almost accurate for the
sample size more than 12.</p>
        <p>Yet if variances of two samples are different, the Student’s t-test will not give exact
values for the levels of significance even for normal distributions (Behrens-Fisher
problem, having no exact solutions today, only approximate).</p>
      </sec>
      <sec id="sec-7-2">
        <title>Comparing variances of two samples</title>
        <p>The test of equality of variances for two independent normal samples is based on the
Fisher’s F-statistics. The Fisher’s test based on the F-statistics is very sensitive to
deviations from normality.</p>
      </sec>
      <sec id="sec-7-3">
        <title>Large samples</title>
        <p>For large samples the law of large numbers and the central limit theorem
“mechanism” both work. With corresponding norming applied the sample mean of the large
number of observations will be close to the mean or will have a distribution close to
normal, even if the observations themselves do not have normal distribution.
In this situation the means of a large number of the observation squares (Pearson’s
chi-squared statistics), as a rule, have almost chi-squared distributions.
We should keep in mind that proximity to the normal distribution and the chi-squared
distribution depends on the sample size and the observation distributions.
Another example is related to the maximum likelihood estimates, which have many
useful properties. However some of these properties hold only for very large samples.
In real practice the samples are almost never very large.</p>
      </sec>
      <sec id="sec-7-4">
        <title>Distribution of the Pearson’s sample correlation coefficient r</title>
        <p>Let  be the correlation coefficient of a couple of random variables  and  :
 =
 =
 {( −  { })( −  { })}
√ { } { }</p>
        <p>,
∑

 =1(  −  ̅)(  −  ̅)
√∑ =1
(  −  ̅)2 ∗ ∑
 =1</p>
        <p>(  −  ̅)2
and let  be the Pearson’s correlation coefficient
for a bivariate sample of observations of these variables.</p>
        <p>For random variables  and</p>
        <p>with bivariate Gaussian distribution when  ≠ 0 the
distribution function and the density of the Pearson’s correlation coefficient  can not
be expressed via elementary functions but it can be represented using the
hypergeometric function. For the case when  = 0, representations of the correlation
coefficient density via elementary functions are known.</p>
        <p>When  = 0 for large samples (the sample size  → ∞) the Pearson’s correlation
coefficient  has asymptotically normal distribution.</p>
        <p>However the convergence of the  coefficient to the normal distribution is too slow.
It’s not recommended to use normal approximation when  &lt; 500. In this case the
Fisher transformation for the  coefficient can be used. It leads to a new variable  ,
that has a distribution which is much more close to normal. Using this distribution it’s
possible to find confidence intervals for the  coefficient.</p>
        <p>
          The research of the problem of sensitivity to deviations from the normal distribution
of the  coefficient cannot be considered complete by this time. One of the reasons is
that the distributions of  for non-normal samples are developed in detail for a
relatively small number of certain cases. There are examples when the sensitivity of  to
deviations from normality is high as well as examples when it’s rather insignificant
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements References</title>
      <p>This work was partially supported by a grant of RFBR (project 16-01-00184 A)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Good</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hardin</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Common errors in statistics, and how to avoid them</article-title>
          . Wiley,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Johnson</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotz</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balakrishnan</surname>
            <given-names>N.</given-names>
          </string-name>
          <article-title>Continuous univariate distribution</article-title>
          . Vol.
          <volume>2</volume>
          . 2ed, Wiley,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kotz</surname>
            <given-names>S</given-names>
          </string-name>
          , et al.
          <source>Encyclopedia of statistical science</source>
          ,
          <volume>16</volume>
          volumes. 2ed, Wiley,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lemann</surname>
            <given-names>E.</given-names>
          </string-name>
          <article-title>On the history and use of some standard statistical models</article-title>
          .
          <source>Probability and Statistics: Essays in Honor of D. Freedman</source>
          . Vol.
          <volume>2</volume>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tukey</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Exploratory data analysis</article-title>
          .
          <source>Addison Wesley</source>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kоbzar</surname>
            <given-names>АI</given-names>
          </string-name>
          .
          <source>Applied mathematical statistics. M.: FM</source>
          ,
          <year>2006</year>
          . [In Russian]
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lagutin</surname>
            <given-names>MB</given-names>
          </string-name>
          .
          <source>Pictorial mathematical statistics. M.: BINOM</source>
          ,
          <year>2007</year>
          . [In Russian]
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Prokhorov YuV (ed.)
          <source>Probability and Mathematical Statistics. Encyclopedia. M.: GRE</source>
          ,
          <year>1999</year>
          . [In Russian]
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Tyurin</surname>
            <given-names>YuN</given-names>
          </string-name>
          , Makarov AA.
          <article-title>Statistical data analysis on the computer</article-title>
          . M.:
          <string-name>
            <surname>INFRA-M</surname>
          </string-name>
          ,
          <year>1998</year>
          . [In Russian]
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Shiryaev</surname>
            <given-names>AN</given-names>
          </string-name>
          .
          <article-title>Essentials of stochastic finance: facts, models</article-title>
          , theory, World Scientific,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Cohen-Tannoudji</surname>
            <given-names>C.</given-names>
          </string-name>
          (et al.)
          <article-title>Levy statistics and laser cooling</article-title>
          . Cambridge university press,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>