Ellipsoidal distribution-free set
                                Dmitriy Klyushin 1,*, , Andrii Tymoshenko1,*

                                Taras Shevchenko National University of Kyiv, 60 Volodymyrska Street, 01033, Kyiv, Ukraine
                                1


                                                Abstract
                                                This paper introduces a distribution-free approach based on the Hill's assumption and the Petunin
                                                ellipsoids. Several distributions are used to generate points and build ellipsoids, which are then used to
                                                check if test points with same distribution are created inside largest ellipsoid. As a result, a new
                                                prediction set is constructed in the form of Petunin ellipsoid, while confidence level refers to the number
                                                of points. The method described here works effectively for chosen distributions. Moreover, statistical
                                                analysis of the quantity of points inside is performed. This method is a useful tool for solving many
                                                urgent problems of machine learning, e.g. generalization of training samples, effective cross-validation
                                                etc.

                                                Keywords
                                                data mining, prediction set, Petunin ellipsoid, outlier detection 1


                                1. Introduction
                                Construction of prediction sets is a popular problem in machine learning, often placed together
                                with neural networks. In recent years many international scientists have discussed prediction sets.
                                For example, Adam Khakhar, Stephen Mell and Osbert Bastani [1] used a trained code generation
                                model in algorithm that leverages an abstract syntax tree based on programming language to
                                create a set of programs with high confidence about the correct program. Another example by
                                Soroush H. Zargarbashi, Mohammad Sadegh Akhondzadeh and Aleksandar Bojchevski [2] derive
                                provably robust sets by defining bounds for the worst-case change related to conformity scores.
                                    Another important idea is to find distribution-free classification and sets. Here we can mention
                                a work by Chirag Gupta, Aleksandr Podkopaev and Aaditya Ramdas [3] where they study
                                calibration and prediction sets combined with confidence intervals. Their research is dedicated to
                                binary classification in case of distribution-free sets. Based on demonstrated theorems, confidence
                                intervals for binned probabilities allow to perform distribution free calibration.
                                    In [4] Hongxiang Qiu, Edgar Dobriban and Eric Tchetgen Tchetgen offer a novel flexible
                                distribution-free method named PredSet-1Step for constructing prediction sets where asymptotic
                                coverage is guaranteed under unknown covariate shift.
                                    As for [5] A.N. Angelopoulos, S. Bates, J. Malik and M.I. Jordan present an algorithm which
                                changes a chosen classifier to determine a predictive set, where the true label is inside with
                                probability set by user. This simple and fast algorithm reminds of Platt scaling but results into a
                                formal finite-sample coverage for every model and dataset.
                                    Approach described in [6] by analysis of a holdout set allows to choose the size of the prediction
                                sets and leads to explicit finite-sample guarantees. As a result, simple, distribution-free and


                                Information Technology and Implementation (IT&I-2024), November 20-21, 2024, Kyiv, Ukraine
                                 Corresponding author.
                                 These authors contributed equally.
                                   dmytroklyushin@knu.ua (D. Klyushin); andriitymoshenko@knu.ua (A. Tymoshenko)
                                   0000-0003-4554-1049 (D. Klyushin); 0000-0002-8884-9054 (A. Tymoshenko)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                                                                                                    28
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
rigorous error control is obtained for many tasks, demonstrated on five large-scale machine
learning problems.
    Some more works related to predictions offer various approaches: prediction based on language
models [7], neural networks compared to calibrated predictions [8], distribution-free uncertainty
quantification and conformal prediction [9], conformal risk control [10], conformal predictors
applied for medical imaging [11], confident prediction in case distributions shift [12], conformal
prediction robust to label noise [13], conformal prediction via probabilistic circuits [14].
    Not only predictions attract modern scientists. Some related topics are also worth being
mentioned: uncertainty quantification is performed over graph using conformalized graph neural
networks [15], adversarial robustness applied to randomly smoothed classifiers [16], randomized
smoothing for graphs and images [17], adversarially trained smoothed classifiers [18].
    The purpose of our paper is to describe a method to construct ellipsoidal prediction set for a set
of the randomly generated points, based on chosen distribution. The main tools for our forecast are
predictive sets represented by ellipses, constructed using generated points. Test points are
generated with same distribution, the more ellipses contain a point the higher probability of
belonging to same class can be expected. Consider the problem of creating conformal prediction
based on points x1 , x2 ,..., xm  d . The aim is to find a prediction set E ( x1 , x2 ,..., xm )  d resulting
into probability p ( xm +1  E )  1 −  , where 0    1 is a chosen significance level, so that
p ( xm +1  E ) is the confidence level of the predictive set.


2. Hill`s Assumption A( m)

As x1 , x2 ,..., xm we denote a sample drawn from a population generated according to absolutely
continuous distribution F. Next, we arrange it in the increasing order and create the variance series
 x(1)  x(2)  ...  x( m) , where x(i ) is i-th order statistics. The resulting order statistics x(1) , x(2) ,..., x( m)
are dependent. The distribution Fk ( x ) of the k-th order statistics x( k ) can be calculated as
            m
Fk ( u ) =  Cmi  F ( u ) 1 − F (u )
                                                m −1
                                                        , where F ( u ) = p ( x  u ) .
                                 i

           i =k

                                     A( m) [19] states that if x(i ) is chosen from the population according to
distribution F then
                                                    j −i
                                           p ( xm +1  ( x(i ) , x( j ) ) ) =
                                                         , j  i.                                (1)
                                                    m +1
                            A( n ) was proven in papers of Yu.I. Petunin [20] and by several other
scientists. Let us recall the proof for random variables  and  . If they are independent, then
                                                                  
                                           p (   ) =  F ( u ) dF ( u ) ,                                                                         (2)
                                                                  −

where F ( u ) and F ( u ) denote the distribution functions of  and  , respectively. The
probability density of i-th order statistics is:
                                                 m                                                  m
                                     f k ( u ) =  Cmi  F (u ) 1 − F (u )                =  Gi (u ).
                                                                       i                  m −i

                                                i =k                                               i =k
                                                                                                                                                       (3)
                      m
Hence, Fk ( u ) =  G ( u ) ,
                            i
                             '

                     i =k


                  Gk (u ) = Cmk k ( F (u ) )                (1 − F (u ))         f (u ) − ( F (u ) ) ( m − k ) (1 − F (u ) )              f (u ) 
                                                       k −1                  m−k                          k                      m − k −1

                                                                                                                                                     (4)
                                                                                                                                                       29
                                                              m − k −1 
            Gk +1 ( u ) = Cmk +1 ( F ( u ) ) (1 − F ( u ) )          =
                                              k +1

                                                                     
             = Cmk +1 ( k + 1) ( F ( u ) ) (1 − F ( u ) )                         f (u ) − ( F (u ) )           (1 − F (u ) )            ( m − k − 1) f (u ) .
                                                    k                    m − k −1                          k +1                   m−k −2


The second term is compensated by the first term:
                                       m!( m − k )           m!( k + 1)                m!                   m!
  −Cmk ( m − k ) + Cmk +1 ( k + 1) = −                +                       =−                    +                 = 0.
                                       ( m − k ) !k !   ( m − k − 1)( )
                                                                    ! k + 1 !    ( m − k − 1) ! k !   ( m − k −1)!k !

The last term of the previous sum is equal to zero
                                                           ( F (u ) ) (1 − F (u ) ) ( m − m ) f (u ) = 0 .
                                                                    m                     0


Thus,
                                                                                           k −1                     m−k
                                                    f k ( u ) = mCmk −−11  F ( u )            1 − F ( u )          f (u ) .
Let us find p ( x  x(i ) ) and p ( x  x( j ) ) . Using the above equations, we have
                                                                                                                                             1
          p ( x  x(i ) ) =  F ( u ) dFi ( u ) = mCmi −−11   F ( u ) 1 − F (u )
                                                                                                                  m −i
                                                                                                                         dF (u ) = mCmi −−11  v i (1 − v )
                                                                                              i                                                               m −1
                                                                                                                                                                     dv.
                                   −                                      −                                                                  0

It is proven that,
                                                1
                                                                                     ( p )  ( q ) ( p −1)!( q −1)!
                                                 x (1 − x )
                                                                  q −1
                                                    p −1
                                                                         dx =                      =                 .
                                                0                                    ( p + q − 1)    ( p + q −1)!
We can apply this equation as
      1
                                                            ( i + 1)  ( m − i + 1)  (i + 1)  ( m − i + 1) i !( m − i )!
                   (1 − x )               dx = B (i + 1, m − i + 1) =
                              m −i +1−1
     x                                                                             =                           =
          i +1−1

     0                                                        (i +1 + m +1− i )             ( m + 2)             ( m + 1)!
                                               j !( m − j )!       m ( m −1)!( j −1)! j !( m − j )!         j
               p ( xm+1  x( j ) ) = mCmi −−11               =                                        =          .
                                                ( m + 1)! ( m − j )!( j −1)!m ( m +1)( m −1)! m +1
Previous equation was obtained by multiplying numerator and denominator by (j 1). So, we get

                               (            (              ))
                 p xm +1  x(i ) , x( j ) = p ( xm +1  x j ) − p ( xm +1  xi ) =
                                                                                      j
                                                                                           −
                                                                                               i
                                                                                    m +1 m +1 m +1
                                                                                                    =
                                                                                                       j −i
                                                                                                              .
So, in case a random variable x is independent from x1 , x2 ,..., xm and it is chosen by sampling from
the same population based on distribution F ( u ) , then
                                                                                                       m −1
                                                                   p ( x  ( x(1) , x( m ) ) ) =            .
                                                                                                       m +1
                                                                                                                                      m −1
Remark 1. The confidence level of the tolerance interval ( x(1) , x( m) ) is                                                               , thus for m  39 the
                                                                                                                                      m +1
confidence level of this interval is less than 0.05.

3. Petunin Ellipsoids
   The algorithm for construction of the ellipsoid containing the set as m random points with the
             m −1
probability        was proposed by Yu. I. Petunin. Statistical and geometrical properties of the
             m +1
Petunin ellipsoids were investigated in [21].
   Here we applied                                      -dimensional case. First, we find two points
                                                    M n =  x1 ,..., xm 
farthest from each other xk and xl of the set                             . Connect them with a line (next
mentioned as diameter), then project all the points to the hyperplane orthogonal to this line. To
simplify this, we can rotate all objects together around line center to make it horizontal. But then
we will need to rotate it back in the end (Figure 1).
                                                                                                        30
                           Figure 1: The furthest from each other points

    Next, we need to find the farthest points from the line. Construct lines parallel to the diameter
through these points. Create lines that are orthogonal to diameter and pass through the farthest
from each other points. As a result, we obtain a rectangle, which covers the given set of points and
lies on a two-dimensional plane (Figure 2).


                       Figure 2: Rectangle covering the images of the points

   By dividing the shorter side length by the longer side length, we can obtain the shrinking
coefficient. Translate, rotate and shrink mentioned above rectangle to construct a square covering
the given points.


                        Figure 3: Square covering the images of the points

    Find its center and all distances from the center of the square to every image of point. Then, we
need to find maximal distance and create a circle with center same as square center and radius that
is equal to maximum distance from the center to the images of points.
                                                                                                  31
                        Figure 4: The circle covering the images of the points

   Perform inverse transformations of this circle. The result is the Petunin ellipse (Figure 5).


                          Figure 5: Petunin ellipse covering the initial points

   In high-dimensional case, we can construct a minimum volume axis-aligned orthogonal
parallelepiped that contains images x1,..., xm of initial points. Perform shrinking transformation
from the orthogonal parallelepiped to a hypercube. Find its center and distances from it to x1,..., xm .
Next find the maximum distance. After that, construct a hypersphere with the center and radius
that is equal to the maximum distance from the center to the images of the points. Perform inverse
transformations (translation, rotation and stretching) and obtain the Petunin ellipsoid Em . Hill s
                                             m −1
assumption A( m) is true, so P ( xm +1  E ) =    .
                                             m +1
    Since at the last stage of construction of the Petunin ellipsoid we obtain the concentric spheres
with one unique point at the surface, using the Petunin ellipsoids we can arrange the points by
their statistical depth. The median point of the set (the most probable point) is the point nearest to
the center of the Petunin ellipsoid and the outlier is the point at the boundary of the Petunin
ellipsoid.


4. Numerical results
In this section testing results are described. First, we generate a chosen distribution-based set of
1000 points and build ellipses through each point. Then we generate 1000 more points with the


                                                                                                       32
same distribution and check the number of points inside the largest ellipse. Statistical
characteristics of these results are shown below for three different distributions.

4.1. Normal distribution
The first test was performed on normal distribution testing, 3 to 1. We generate 12 random
numbers from 0 to 1, calculate their sum and subtract 6. Then we modify values by multiplying the
result and adding value so that horizontal and vertical coordinate values can be generated in
proportion 3 to 1.

   Expected probability 0.99762
   Mean 0.99762
   Mean Deviation 0.001873
   Mode 0.996477
   Median 0.998
   Standard Deviation 0.0022867
   Variance 0.0000052279
   Kurtosis 3.24629
   Skewness -0.892779
   Range 0.01
   Maximum 1
   Minimum 0.989
   Geometric Mean 0.9976174
   Harmonic Mean 0.9976148


                  Figure 6: Average ellipse areas   Normal distribution, 100 tests

   As we can see, the largest ellipse contains almost all points and areas slowly grow until last 50
points defining largest ellipses.


                                                                                                 33
4.2. Exponential distribution
Exponential with parameters -17, -50 as multipliers for logarithm from random value from 0 to 1.

   Expected probability 0.998158
   Mean 0.998158
   Mean Deviation 0.00138457
   Mode 0.99943
   Median 0.999
   Standard Deviation 0.0018747
   Variance 0.00000351465
   Kurtosis 5.98476568
   Skewness -1.5228786
   Range 0.01
   Maximum 1
   Minimum 0.989999
   Geometric Mean 0.998
   Harmonic Mean 0.998


                Figure 7: Average ellipse areas    Exponential distribution, 100 tests


Here the largest ellipse contains almost all points, but ellipse areas grow much faster.

4.3. Gamma distribution
Gamma distribution was used here based on pseudorandom numbers with parameters 50 and 90
for horizontal and vertical values respectively.

Expected probability 0.997760000000000
                                                                                                   34
Mean 0.997760000000000
Mean Deviation 0.0017616
Mode 0.998438
Median 0.998
Standard Deviation 0.002399
Variance 0.0000057599
Kurtosis 5.95855
Skewness -1.65869
Range 0.012
Maximum 1
Minimum 0.987999
Geometric Mean 0.997757
Harmonic Mean 0.997754


                  Figure 8: Average ellipse areas    Gamma distribution, 100 tests

    In this test ellipse contains almost all points and ellipse areas increase very fast after ellipse
based on 800 points.
    More tests were performed for these distributions with other parameters and results were alike.
Ellipse areas increased smoothly at first, but for the last 100-200 most distant points faster increase
was reported. As for accuracy, we expected values to be approximately 0.998 and received alike
results.

distribution with rectangular area covered.

5. Conclusion
   Constructing Petunin ellipsoid is a useful approach for arranging data and detecting anomalies
using statistical depth. According to obtained results, the algorithm leads to effective prediction
                                                                                                 35
sets based on Petunin ellipsoid. The confidence level reached is theoretically precise for tested
distributions. It allows us to compute statistical depth based on every point and detect outliers of
the set. Experimental results approved theoretical properties of the Petunin ellipses.


Declaration on Generative AI
The authors have not employed any Generative AI tools.

References
[1] A. Khakhar, S. Mell, O. Bastani, PAC Prediction Sets for Large Language Models of Code. 2023.
     doi:10.48550/arXiv.2302.08703
[2] S. H. Zargarbashi, M. S. Akhondzadeh, A. Bojchevski, Robust Yet Efficient Conformal
     Prediction Sets. 2024. doi:10.48550/arXiv.2407.09165.
[3] C. Gupta, A. Podkopaev, A. Ramdas, Distribution-free binary classification: prediction sets,
     confidence            intervals           and          calibration,        2020.          URL:
     https://proceedings.neurips.cc/paper/2020/file/26d88423fc6da243ffddf161ca712757-Paper.pdf.
[4] H. Qiu, E. Dobriban, E. Tchetgen, Prediction sets adaptive to unknown covariate shift, Journal
     of the Royal Statistical Society Series B: Statistical Methodology 85(5) (2023) 1680 1705.
     doi:10.1093/jrsssb/qkad069.
[5] A. Angelopoulos, S. Bates, J. Malik, M. I. Jordan, Uncertainty sets for image classifiers using
     conformal prediction, 2020. URL: https://arxiv.org/abs/2009.14193.
[6] S. Bates, A. Angelopoulos, L. Lei, J. Malik, M. I. Jordan, Distribution-free, risk-controlling
     prediction sets, 2021. URL: https://arxiv.org/abs/2101.02703.
[7] T. Liu, Y. Jiang, N. Monath, R. Cotterell, M. Sachan, Autoregressive structured prediction with
     language models, 2022. URL: https://arxiv.org/abs/2210.14698.
[8] S. Park, O. Bastani, N. Matni, I. Lee, Pac confidence sets for deep neural networks via
     calibrated prediction, 2020. URL: https://arxiv.org/abs/2001.00106.
[9] A. N. Angelopoulos, S. Bates, A gentle introduction to conformal prediction and distribution-
     free uncertainty quantification, 2021. URL: https://arxiv.org/abs/2107.07511.
[10] A. N. Angelopoulos, S. Bates, A. Fisch, L. Lei, T. Schuster, Conformal risk control. ArXiv,
     abs/2208.02814, 2022. URL: https://api.semanticscholar.org/CorpusID:251320513.
[11] C. Lu, A., Lemay, K. Chang, K. Hobel, J. Kalpathy-Cramer, Fair conformal predictors for
     applications in medical imaging, in: Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 36, 2022, pp. 12008 12016.
[12] M. Cauchois, S. Gupta, A. Aliand, J.C. Duchi, Robust validation: Confident predictions even
     when distributions shift. arXiv preprint arXiv:2008.04267, 2020.
[13] B.-S. Einbinder, S. Bates, A. N. Angelopoulos, A. Gendler, Y. Romano, Conformal prediction is
     robust        to      label      noise.       ArXiv,       abs/2209.14295,    2022.       URL:
     https://api.semanticscholar.org/CorpusID:262091979.
[14] M. Kang, N. M. Gurel, L. Li, B. Li, COLEP: Certifiably robust learning-reasoning conformal
     prediction via probabilistic circuits. In The Twelfth International Conference on Learning
     Representations, 2024. URL: https://openreview.net/forum?id=XN6ZPINdSg.
[15] K. Huang, Y. Jin, E. Candes, J. Leskovec, Uncertainty quantification over graph with
     conformalized graph neural networks, 2023. URL: https://arxiv.org/pdf/2305.14535.
[16] G.-H. Lee, Y. Yuan, S. Chang, T. Jaakkola, Tight certificates of adversarial robustness for
     randomly smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.
[17] A. Bojchevski, J. Gasteiger, S. Gunnemann, Efficient robustness certificates for discrete data:
     Sparsity-aware randomized smoothing for graphs, images and more. In: International
     Conference on Machine Learning, pp. 1003 1013. PMLR, 2020.
                                                                                                 36
[18] H. Li, J. Salman, I. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, G. Yang, Provably robust deep
     learning via adversarially trained smoothed classifiers. Advances in Neural Information
     Processing Systems, volume 32, 2019.
[19]
     population, Journal of the American Statistical Association 63 (1968) 677 691.
[20] I. Madreimov, Yu. I. Petunin, Characterization of the uniform distribution using order
     statistics, Theory of Probability and Mathematiical Statistics 27 (1983) 105 110.
[21] S. I. Lyashko, B. V. Rublev, Minimal Ellipsoids and Maximal Simplexes in 3D Euclidean Space,
     Cybernetics           and        Systems        Analysis         39        (2003)     831 834.
     doi:10.1023/B:CASA.0000020224.83374.d7.


                                                                                                37