1. Introduction

The Fourth International Workshop on Computer Modeling and Intelligent Systems, April

10.1007/s00354-017-0027-x

Homogeneity hypothesis in discriminant analysis

Dmitriy Klyushin

0 0 Taras Shevchenko National University of Kyiv, Ukraine , 03680, Kyiv, Akademika Glushkova Avenue 4D

2020

27 2021 0000 0003

One of the most important properties of a machine learning algorithm is its ability to generalize results of learning on finite training sets. This property is based on the compactness hypothesis stating that objects of the same class in the feature space, as a rule, are located closer to each other than to objects of other classes. The compactness hypothesis has a geometric nature and uses the concept of proximity in the feature space, which is most often expressed in terms of a metric. Meanwhile, this hypothesis does not fully take into account the probabilistic nature of the features. It is quite suitable for data with unimodal distributions that have a compact support, but in a general case it may not hold leading to incorrect generalizations. In the paper, an alternative approach is described in which the homogeneity hypothesis is used instead of the compactness hypothesis. Within the framework of this approach, objects are called homogeneous, if their features follow identical distributions. We propose as measures of homogeneity the Petunin's p-statistics and its versions, which is highly efficient in recognizing both disjoint and significantly overlapping samples that violate the compactness assumption. This approach has a rigorous mathematical foundation and high efficiency in practical applications.

1 Discriminant analysis relational analysis featureless pattern recognition compactness hypothesis homogeneity hypothesis

1. Introduction

The complexity of machine learning significantly depends on the compactness hypothesis, which allows generalizations based on finite training samples. Intuitively, the hypothesis states that in the feature space, similar objects should be closer to each other than to dissimilar ones. This definition appeals to the geometric concept of proximity and implicitly uses a metric. A typical example of a method based on such principles is the nearest neighbor method, which recognizes test objects by their closeness to training objects.

The compactness hypothesis ignores the probabilistic nature of the random training data. More precisely, it is only acceptable for classifying random data with unimodal distributions having compact support. In practice, such a condition is too burdensome. Therefore, it is necessary to develop a method that would estimate the proximity between random samples on different principles. To solve the problem, let us introduce the concept of object homogeneity, which means that objects are drawn from the same population, i.e. their features obey the same distribution. This allows objects to be classified using criteria to test static hypotheses of homogeneity.

The aim of the article is to describe a new approach to machine learning based on the homogeneity hypothesis as an alternative to the compactness hypothesis. Using a measure of homogeneity, not just a metric, allows generalizing relational discriminant analysis and increasing its efficiency.

2. Distance‐based machine learning techniques

Duin, Pekalska et al. [1–4], Mottle, Seredin et al. [5–8] and others proposed the concept of featureless discriminant analysis. They suggested replacing the feature vector of an object with an estimate of its proximity to some training set using a metrics. Unfortunately, this approach is poorly suited to solving problems often encountered in biomedical research. Let us say a researcher studies the parameters of a set of cells. In this case, it gets samples of real numbers, not an ordered vector. In such cases, the metric is not applicable and the only useful tool is a homogeneity (similarity) measure. Among the numerous statistical tests for the homogeneity of samples, only the Kolmogorov-Smirnov test, the Wilcoxon test and the Klyushin-Petunin test allow us to assess robustly the homogeneity of samples in the form of the probability of belonging to the same general population [9].

As it was noted above, in the featureless discriminant analysis, objects are represented not as vectors in a feature space, but as a measure of proximity to a training set. As a result, the starting point of featureless analysis is a distance matrix filled with distances or labels characterizing the similarity between objects in the training set (reference points). As a basic distance, usually Euclidean and pseudo-Euclidean distances are used coupled with the kernel trick. Obviously, this approach leads to problems with generalization power and strong dependence from a training set. In addition, it is not valid for samples containing of independent identically distributed (i.i.d.) random values.

Recently, this approach was renewed as machine learning techniques such as Minimal Learning Machine [10] and the Extreme Minimal Learning Machine [11]. The main tool in these methods is the nonlinear distance regression, which estimates the dissimilarity between observations. Nowadays, various metrics and learning techniques are used in this field [12–21]. Excellent surveys of these methods may by found in [22, 23]. These methods have some useful advantages, but they use Euclidean distance-constructed probability distributions. Thus, they fail in situations when reference points are not vectors in some vector space but samples of i.i.d. random values.

The problem we try to solve is to extend the application field of the distance-based machine learning techniques using not metrics to estimate distances between objects but the homogeneity (similarity) measures described below, and propose an alternative way of similarity-based classification. These homogeneity measures do not dependent on underlying distributions of training samples and have useful properties of generalization. For example, in opposite to standard counterparts (the Kolmogorov–Smirnov statistics and the Wilcoxon statistics), they successfully work both with samples following distributions with different means and identical variance and with identical means and different variances.

3. Two‐sample homogeneity measure

Consider training samples x   x1, x2 ,..., xn   G1 and y   y1, y2 ,..., yn   G2 from populations G1 and G2 following absolutely continuous distribution functions F1 and F2 . We reduce the classification of a test sample z   z1, z2 ,..., zn  to testing of homogeneity z and x from the one side, and z and y from the other side. There are many nonparametric tests for two samples homogeneity: Kolmogorov–Smirnov test, Wilcoxon sign rank test etc. (see, for example, [24]). However, as it will be shown, the most effective tool for testing homogeneity of two samples is the Petuninʼs p-statistics [25]. This is explained by the fact that the p-statistics has similar high significance and sensitivity independently in both cases when samples are disjoint or almost overlapped. 3.1.

Original Klyushin–Petunin test

The Klyushin–Petunin test [25] is non-parametric one and use only assumption that distribution functions are absolutely continuous. This test uses the Hill's assumption A(n) [26] stating that for exchangeable random values x1, x2 ,..., xn G following to an absolutely continuous distribution function we have:

P  x  xi , x j   j  i definiteness, we use the Wilson confidence interval Iijn   pij1 , pij2  where j  i where xi and x j are the i-th and j-th order statistics. The Hillʼs assumption was proved as for i.i.d. random values [27] and for exchangeable i.d. random values [28]. Finding the relative frequency hij of the event zm  xi , x j  for the elements of z, we can estimate a proximity between hij and . This may be made using numerous confidence intervals for binomial proportion. For pi(j1)  hij n  g 2 2  g hij (1  hij )n  g 2 4 pi(j2)  hij n  g 2 2  g hij (1  hij )n  g 2 4 n  g 2 n  g 2 , j  i  Iijn   n  n 1 

 , n  1 2 

The significance level of this interval is the function of g. When g = 3 the significance level of Iijn does not exceed 0.05 [25]. P-statistics, estimating the homogeneity of samples x and z, is defined by the equation [31]. When the null hypothesis holds, lim n n  1 0,1 , and lim n n  1

As we see, the p-statistics is the estimation of the probability that the samples are homogeneous, therefore, using ( 2 ) we may formulate the following test: the null hypothesis is accepted if h is greater than 0.95, else it is rejected.

 When the null hypothesis is true, the events  pij  n  1 scheme [29, 30]. When the alternative hypothesis is true, they generate a modified Bernoulli scheme. When the null hypothesis can be either true or false, they generate the Matveichuk–Petunin scheme j  i i  Iijn 

 generate a generalized Bernoulli j  i 0,1 , then the asymptotic significance level  of a sequence of confidence intervals Iijn is less than 0.05 [25]. 3.2.

Modified Klyushin–Petunin test

In practice, samples, as a rule, contain rounded numbers and duplicates (ties). Thus, we must distinguish a hypothetical sample drawn from a hypothetical population G of precise measurements and an empirical sample drawn from an empirical population G of rounded measurements. Let us introduce a sample x   x1, x2 , ..., xn  approximating a hypothetical sample x   x1, x2 , ..., xn  and let the variational series x( 1 )  x( 2 )  ...  x(n) and x( 1 )  x( 2 )  ...  x(m) be variational series of hypothetical and empirical samples.

If a number x is drawn from G independently from x then ( 1 ) ( 2 ) ( 3 ) p  x  x(k) , x(k1)   , where tl  t  x(l)  is the multiplicity of x(l) . If x does not contain ties then  i  0.

Suppose, that the hypothetical population G follows a hypothetical absolutely continuous distribution function F . Then, ( 4 ) holds. Consider empirical samples x   x1,..., xn  and z   z1,..., zn . Using the Wilson confidence interval Iij   pi(j1) , pi(j2)  for the probability ( 5 ) of the event zk  x(i) , x( j)  we find an observed relative frequency. Let us denote N  # Iij  1  j  i  compute the empirical p-statistics h  #   Iij . Then, we can formulate the following test: the

N  n 1  null hypothesis is accepted if the h (the probability that the samples are homogeneous) is greater than 0.95, else the null hypothesis is rejected. n  n 1 2 and 3.3.

Exact Klyushin–Petunin test

As we see, the versions of the Klyushin–Petunin test based on the Wald confidence interval depend on the parameter g, that varies from 1.96 for the normal distribution to 3 for an general unimodal distribution. To avoid this uncertainty, we propose to use the exact confidence interval for the unknown probability p on the basis of the proportion h in the Bernoulli model consisting of n trials [32]. To do this, consider two functions depending on p 0,1 :

  p   h  p and ( 4 ) ( 5 ) Denote 8 3 where  

is the parameter of the Vysochansky–Petunin inequality [33] The graph of   p , p  R1 is the upper half of the ellipse E passing through the points   p  1 2n   n

np 1  p  p  y  m  y     y     p  np 1  p 

, p  R1 . 1 12 1 12 with the center  12 , 0  . The graph of   p is the restriction of the graph of   p on the segment 0,1 stretching or shrinking the graph by  and shifting it by 1 .

n 2n

Therefore, the graph of the function   p which does not depend on h is an arc of ellipse  passing through the points 0, 0 ,  1 ,  1   , 1, 1 , such that the function   p reach the  2  2   minimum at the point p  1 and it is symmetrical with respect to this point.

2 The lower confidence limit p1 is a root of the quadratic equation 1  n2  p2   n2  n1  2h  p  h2  nh  41n2 1 

, then the lower confidence limit p1 is the least root of ( 6 ). If h  0 , then p1  0 .

Similarly, the upper confidence limit p2 is a root of the square equation 1  n2  p2   n2  1n  2h  p  h2  nh  41n2 1   2 

  0. 3 

If 1  h  1 , then the upper confidence limit p2 is the largest root of ( 4 ). If 1  h  1 , then p2  1.

Remark. Since p1  h  p2 , the proportion of successes always is in the confidence interval  p1, p2  .

For the generalized Bernoulli model similar reasoning gives the following quadratic equation for lower confidence limit:

1   m n n 21m 2  p2   m1   mn n 21m 2  2h  p  h2  mh  4m12 1  32   0 ( 8 ) If h  21m  m12   , then the lower confidence limit p1 for the generalized Bernoulli model is the least root of ( 8 ). If h   , then p1  0 .

Similar, the upper confidence limit p2 for the generalized Bernoulli model is the root of the equation

1   m n n 21m 2  p2   m1   mn n 21m 2  2h  p  h2  mh  4m12 1  32   0 ( 9 ) If 1  h   , then the upper confidence limit p2 is the largest root of ( 9 ). If 1  h   , then p2  1. By virtue of the previous results the significance level of the confidence interval does not exceed 4 1 9  2 (in particular, 0.05 for   3 ). ( 6 ) ( 7 )

4. Experiments and results

To assess the true positive and true negative rates of the proposed tests, we performed numerical experiments using samples from the normal distribution N ,  of various degree of overlapping. We considered 100 samples of 40 random numbers having different averages and the same variance (location shift) and as well as 100 samples of 40 random numbers having the same average value and different variance (scale shift). We calculated the average p-statistics and its lower and upper confidence limits, the average Kolmogorov-Smirnov statistics and its p-value, and the average Wilcoxon statistics and its p-value. To estimate the true positive rate of the Klyushin–Petunin test we used the relative frequency of an event when the p-statistic is less than 0.95 for different distributions. The true positive rate of the Kolmogorov–Smirnov and Wilcoxon sign rank tests is the relative frequency of an event when the corresponding p-value is less than 0.05, when the distributions are different. The true negative rate of the Klyushin–Petunin test is the relative frequency of an event when the upper confidence limit of the p-statistic is greater than 0.95 for identical distributions. The true negative rate of the Kolmogorov–Smirnov tests and Wilcoxon signed ranks is the relative frequency of an event when the value of p is less 0.05, when the distributions are identical. Thus, we tested two statistical hypotheses: location shift and scale. The null location shift hypothesis means that the mathematical expectations of both distributions are identical. The null scale hypothesis means that the variances of both distributions are identical. Alternative hypothesis, in contrast, asserts that the distribution functions are different. The results are presented in Tables 1-11.

Table 1

P‐statistics for the location shift hypothesis without ties

Distribution N( 0,1 ) N( 1,1 ) N( 2,1 ) N( 3,1 ) N( 4,1 ) N( 0,1 ) 1.000 0.752 0.680 0.457 0.389 N( 1,1 ) – 1.000 0.846 0.584 0.424 N( 2,1 ) – – 1.000 0.680 0.442 N( 3,1 ) – – – 1.000 0.570 N( 4,1 ) – – – – 1.000

Table 2

Exact p‐statistics for the location shift hypothesis without ties

Distribution N( 0,1 ) N( 1,1 ) N( 2,1 )

N( 0,1 ) 1.000 0.646 0.459 N( 1,1 ) – 1.000 0.990 N( 2,1 ) – – 1.000 N( 3,1 ) – – – N( 4,1 ) – – – N( 3,1 ) 0.376 0.522 0.859 1.000 –

Note that the p-statistic is monotonically decreasing as the location shift increases. As expected, in this case the Kolmogorov-Smirnov and Wilcoxon sign rank tests work well. However, when the distribution functions are largely overlapped the discrepancy between them is not very significant. Moreover, the Wilcoxon signed-rank test poorly recognizes the inversions between largely overlapped samples. These statements are justified by the following results (Table 5–8).

Table 5

P‐statistics for the scale shift hypothesis without ties

Distribution N( 0,1 ) N( 0,2 ) N( 0,3 ) N( 0,4 ) N( 0,5 ) N( 0,1 ) 1.000 0.726 0.641 0.581 0.427 N( 0,2 ) – 1.000 0.819 0.753 0.620 N( 0,3 ) – – 1.000 0.979 0.976 N( 0,4 ) – – – 1.000 0.998 N( 0,5 ) – – – – 1.000

The Kolmogorov–Smirnov test fails when samples are largely overlapped in more than almost a half of the cases, and the Wilcoxon signed-rank test has failed at all. The Klyushin–Petunin test fails in almost a third of cases of very overlapped samples following the distributions N( 0,3 ), N( 0,4 ) and N( 0,5 ).

To simulate the ties in samples we rounded the samples from previous experiments to two decimal digits. Due to this, every sample contained four ties. The results are provided in Tables 9–12. The construction of the Kolmogorov–Smirnov and Wilcoxon signed rank tests do not depend on ties. Thus, we provide only results for the p-statistics.

Table 9

P‐statistics for the location shift hypothesis with ties

Distribution N( 0,1 ) N( 1,1 ) N( 2,1 ) N( 3,1 ) N( 4,1 ) N( 0,1 ) 1.000 0.672 0.505 0.355 0.309 N( 1,1 ) – 1.000 0.831 0.305 0.323 N( 2,1 ) – – 1.000 0.705 0.424 N( 3,1 ) – – – 1.000 0.573 N( 4,1 ) – – – – 1.000

The p-statistics monotonically decreases as the difference between the means increases. The Klyushin-Petunin test, like the Kolmogorov-Smirnov test, does not distinguish between the distributions of N( 0, 3 ), N( 0, 4 ), and N( 0, 5 ). At the same time, it turned out to be effective in cases where the Kolmogorov–Smirnov tests and the Wilcoxon sign rank test do not work. Thus, there is an advantage of the p-statistics over the Kolmogorov–Smirnov and Wilcoxon sign rank tests.

5. Conclusions

Correct generalization based on finite training sets depends on correctly chosen underlying hypotheses. Traditional discriminant analysis is based on the compactness hypothesis, which states that objects of one class in the feature space are located closer to each other than to objects from another class. This geometric hypothesis does not work when classifying random samples that differ from feature vectors. For samples, the concept of distance is meaningless. It should be replaced by the concept of homogeneity, meaning that features of objects have the same distribution function. The evaluation of the homogeneity of the samples is provided by the Petunin p-statistics and its variants, which demonstrate high sensitivity and specificity in experiments both when testing the hypothesis of a shift in the mean and in testing the hypothesis of a shift in the scale. The proposed method has a rigorous mathematical justification and high efficiency in practical applications.

6. References

[1]

R.P.W.

Duin , D. de Ridder,

D.N.J.

Tax , Experiments with a featureless approach to pattern recognition , Pattern Recognit Lett 18 ( 1997 ) 1159 - 1166 . doi: 10 .1016/S0167- 8655 ( 97 ) 00138 - 4 .

[2]

R.P.W.

Duin ,

Pekalska , D. de Ridder, Relational discriminant analysis , Pattern Recognition Letters 20 ( 1999 ) 1175 - 1181 . doi: 10 .1016/S0167- 8655 ( 99 ) 00085 - 9 .

[3]

Pekalska ,

R.P.W.

Duin , On combining dissimilarity representations , in: J. Kittler , F. Roli (Eds.), Multiple Classifier Systems, LNCS , vol. 2096 , Springer-Verlag, 2001 , pp. 359 - 368 . doi: 10 .1007/3-540-48219-9_ 36 .

[4]

Pekalska ,

R.P.W.

Duin , The Dissimilarity Representation for Pattern Recognition, Foundations and Applications, World Scientific, Singapore, 2005 .

[5]

Mottl ,

Dvoenko ,

Seredin ,

Kulikowski , I. Muchnik , Featureless pattern recognition in an imaginary Hilbert space and its application to protein fold classification . Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science , 2123 ( 2001 ) 322 - 336 . doi: 10 .1007/3-540-44596-X_ 26 .

[6]

Mottl ,

Seredin ,

Dvoenko ,

Kulikowski , I. Muchnik , Featureless pattern recognition in an imaginary Hilbert space, in: Object recognition supported by user interaction for service robots, Quebec City , QC , Canada, 2002 , pp. 88 - 91 , vol. 2 . doi: 10 .1109/ICPR. 2002 . 1048244 .

[7]

Seredin ,

Mottl ,

Tatarchuk ,

Razin ,

Windridge , Convex support and Relevance Vector Machines for selective multimodal pattern recognition , in: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012) , Tsukuba, Japan, 2012 , pp. 1647 - 1650 .

[8]

Mottl ,

Seredin ,

Krasotkina , Compactness Hypothesis, Potential Functions, and Rectifying Linear Space in Machine Learning: International Conference Commemorating the 40th Anniversary of Emmanuil Braverman's Decease , Boston, MA, USA, April 28 - 30 , 2017 ,

Invited

Talks . doi: 10 .1007/978-3- 319 -99492- 5 _ 3 .

[9]

R.I.

Andrushkiw ,

N.V.

Boroday ,

D.A.

Klyushin , Y.I. Petunin , Computer-aided cytogenetic method of cancer diagnosis , New York, Nova Publishers, 2007 .

[10]

Kulis . Metric learning: A survey . Foundations and Trends in Machine Learning , 5 ( 2013 ) 287 - 364 . doi: 10 .1561/2200000019

[11] A. H. de Souza Junior , F.

Corona , G. A.

Barreto , Y.

Miche , A.

Lendasse , Minimal Learning Machine: A novel supervised distance-based approach for regression and classification . Neurocomputing , 164 ( 2015 ) 34 - 44 . doi: 10 .1016/j.neucom. 2014 . 11 .073.

[12]

D. P. P.

Mesquita ,

J. P. P.

Gomes , A. H. de Souza Junior , Ensemble of efficient minimal learning machines for classification and regression , Neural Processing Letters , 46 ( 2017 ) 751 - 766 . doi: 10 .1007/s11063-017-9587-5.

[13]

A. N.

Maia , M. L. D. Dias , J. P. P. Gomes , and

A. R.

da Rocha Neto, Optimally selected minimal learning machine , in: H. Yin , D.

Camacho , P.

Novais , A. J.

Tall ón-

Ballesteros (Eds.), Intelligent Data Engineering and Automated Learning - IDEAL , Springer International Publishing, Cham, 2018 , pp. 670 - 678 . doi: 10 .1007/978-3- 030 -33617-2.