On the statistics of anomalous clumps in random
point images
Aleksandr L. Reznik1 , Aleksandr A. Soloviev1 and Andrey V. Torgov1
1
 Institute of Automation and Electrometry of the Siberian Branch of the Russian Academy of Sciences, Novosibirsk,
Russia


                                         Abstract
                                         New algorithms for calculating exact analytical formulas describing two related probabilities are proposed,
                                         substantiated and software implemented: 1) the probability of the formation of anomalously large local
                                         groups in a random point image; 2) the probability of the absence of significant local groupings in a
                                         random point image.

                                         Keywords
                                         Random point image, computer analysis, local groupings.


1. Introduction
In many scientific and technical disciplines, when solving applied problems related to digital
image and signal processing, it becomes necessary to assess the degree of randomness of
the analyzed image fragments, depending on the presence or absence of local groupings of
point objects on them (as a result, the analyzed fragment significantly differs from ambient
background). Such problems arise in many scientific and technical disciplines and can be both
purely theoretical [1, 2] and purely applied [3]. For example, the presence of local clumps
in processed aerospace images [4, 5] may indicate the presence of latent objects within the
analyzed fragment that require more detailed study. In computer processing of biomedical
images, one of the most important moments of the preliminary processing stage is the search
for abnormal heterogeneities and thickenings, which may be evidence of various disease-
causing abnormalities that require priority attention [6, 7]. In correlation rhythmography, a
method is known that makes it possible to construct prognostic assessments of the possibility of
restoration of sinus rhythm by a set of intervals on the cardiogram that form an autoregressive
cloud (scatterogram) [8, 9].
   Mathematically similar problems arise when studying the process of registering random
point fields using a scanning aperture with a limited number of threshold levels. When fixing
random coordinates of small-sized (ideally, point) objects that form such a field, a failure occurs
at the moment when the number of signal point objects located within the scanning aperture
exceeds the specified threshold level. It is shown in [10] that in cases where the analyzed image
is formed by a random Poisson flux of constant intensity, the two-dimensional problem of

SDM-2021: All-Russian conference, August 24–27, 2021, Novosibirsk, Russia
" reznik@iae.nsk.su (A. L. Reznik); soloviev@iae.nsk.su (A. A. Soloviev); torgov@iae.nsk.su (A. V. Torgov)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                         246
Aleksandr L. Reznik et al. CEUR Workshop Proceedings                                       246–251


estimating the probability that the registration process will be carried out without failures is
reduced (this is achieved by means of standard factorization) to the following one-dimensional
problem:
   “It is required to find the probability of the event that if 𝑛 points are randomly dropped on the
interval (0, 1), not a single grouping will be formed, located in a certain subinterval Ω𝜀 ⊂ (0, 1),
having a length 𝜀 and including more than 𝑘 points”.
   When solving applied problems related to the detection of abnormally large clumps in the
analyzed images (as, for example, in the above-mentioned problems related to the processing
and analysis of aerospace or biomedical images), knowledge of probability formulas 𝑃𝑛,𝑘 (𝜀)
is required in which the value of the integer parameter 𝑘 is as close as possible to the value
𝑛. In such cases, it becomes necessary to know the exact analytical dependencies 𝑃𝑛,𝑛−1 (𝜀),
𝑃𝑛,𝑛−2 (𝜀), 𝑃𝑛,𝑛−3 (𝜀), etc. In particular, the probability 𝑃𝑛,𝑛−1 (𝜀) corresponds to the fact that
if n points are randomly thrown over the interval (0, 1), all of them will not “merge” into one
𝜀-grouping, the probability 𝑃𝑛,𝑛−2 (𝜀) is that no 𝜀-grouping with a size larger than 𝑛 − 2 will be
formed, the probability 𝑃𝑛,𝑛−3 (𝜀) — that no 𝜀-groupings larger than 𝑛 − 3 will be formed, etc.
   On the other hand, in a number of applied problems, on the contrary, it is required to know
the exact analytical relations for the probabilities when the value of the threshold parameter 𝑘
is minimal, i.e. when 𝑘 = 1, 2, 3, etc. Such formulas are needed in cases when, by the nature
of the research, it is required to estimate the probability that, within the studied interval, the
distribution of 𝑛 random point markers-objects is such that the number of any 𝜀-group does
not exceed the threshold level 𝑘 = 1, 2, 3, etc. The purpose of this work is to propose analytical
methods and software algorithms for finding exact probabilistic formulas 𝑃𝑛,𝑘 (𝜀) for both the
maximum (that is, as close as possible to 𝑛) and the minimum values of the threshold parameter
𝑘.


2. Obtaining particular solutions to a problem using computer
   analytics programs
The simplicity of the problem posed in the introduction is illusory, and its analytical solution is
known only for the simplest case 𝑘 = 1 [11, 12]:
                                                                      1
                          𝑃𝑛,𝑙 (𝜀) = (1 − (𝑛 − 1)𝜀)𝑛 ,      0≤𝜀≤         .                      (1)
                                                                     𝑛−1
   Formula (1) describes the probability of an event that if 𝑛 points are randomly dropped onto
the interval (0, 1), not a single 𝜀-group will be formed containing at least 2 points, that all the
ejected points will be located between themselves at a distance exceeding 𝜀. The classical way
to obtain solution (1) is to represent the desired probability in the form of an easily integrable
iterated integral [11]:
                               ⎧                     ⎧ 𝑥 −𝜀    ⎧ 𝑥 −𝜀    ⎧ 𝑥 −𝜀 ⎫⎫⎫⎫
                     ∫︁1       ⎨ 𝑥∫︁𝑛 −𝜀
                               ⎪                     ⎨ ∫︁4     ⎨ ∫︁3     ⎨ ∫︁2     ⎬⎬⎬⎪    ⎬
     𝑃𝑛,1 (𝜀) = 𝑛!        𝑑𝑥𝑛            𝑑𝑥𝑛−1 . . .       𝑑𝑥3       𝑑𝑥2       𝑑𝑥1           .
                               ⎪
                               ⎩                     ⎩         ⎩         ⎩         ⎭⎭⎭⎪    ⎭
                 (𝑛−1)𝜀         (𝑛−2)𝜀                 2𝜀        𝜀           0


                                                247
Aleksandr L. Reznik et al. CEUR Workshop Proceedings                                      246–251


   Solution (1) can be obtained in a different way. For example, in [10], a simple probabilistic-
geometric method was proposed that allows one to calculate the probability (1) without resorting
to the procedure of multidimensional integration. Thus, it is not difficult to find a solution to
the main problem when 𝑘 = 1. But for 𝑘 > 1 the problem becomes much more complicated.
Here, our efforts have led to the results that will be given below.
   First, note that for arbitrary fixed values of 𝑛 and 𝑘, the desired solution can be represented
in the form of an 𝑛-fold integral:
                                               ∫︁    ∫︁
                                 𝑃𝑛,𝑘 (𝜀) = 𝑛! · · · 𝑑𝑥1 . . . 𝑑𝑥𝑛 ,                            (2)
                                             𝐷𝑛,𝑘 (𝜀)

where the domain of integration 𝐷𝑛,𝑘 (𝜀) is given by the system of linear inequalities
                        ⎧
                        ⎪
                        ⎪   0 < 𝑥1 < 𝑥2 < · · · < 𝑥𝑛−1 < 𝑥𝑛 < 1,
                        ⎨ 𝑥𝑘+1 − 𝑥1 > 𝜀,
                        ⎪
                        ⎪
                        ⎪
                           𝑥𝑘+2 − 𝑥2 > 𝜀,
                           ..
                            .
                        ⎪
                        ⎪
                        ⎪
                        ⎪
                        ⎪
                            𝑥𝑛 − 𝑥𝑛−𝑘 > 𝜀.
                        ⎩

   To calculate integral (2), we developed a method of successive dimensionality reduction based
on the step-by-step replacement of the initial 𝑛-fold integral with a set of structurally similar,
but having dimension one less than the iterated integrals. Further, formalizing this method
and applying cyclic recursion, two systems for analytical calculation of probabilities were
designed and implemented as a computer software in order to calculate the desired piecewise-
polynomial dependence in the form of functions of the continuous parameter 𝜀. One system
calculates the limits of integration for each of the iterated integrals into which the original
𝑛-fold integral (2) decomposes; the second software system is based on multiple differentiation
of the integral (2) with respect to the parameter 𝜀. In addition to the two mentioned software
systems, a third algorithmic scheme was also developed and implemented as software, using a
discrete-combinatorial model to calculate probabilistic formulas 𝑃𝑛,𝑘 (𝜀).
   Analytical calculations performed using the listed software systems made it possible to find a
complete set of partial formulas 𝑃𝑛,𝑘 (𝜀) in all ranges of variation of the continuous parameter
𝜀 for all values of integer parameters 𝑛 and 𝑘 up to 𝑛 = 14. Note that their calculation
is associated with a large amount of routine operations on setting the limits of integration,
checking intermediate systems of inequalities for consistency, and performing direct integration
in 𝑛-dimensional space, which is almost impossible to do “manually” even for 𝑛 = 4. Therefore,
all the necessary software calculations were carried out using the parallel computing algorithms
the use of high-performance computing clusters [13].

2.1. Algorithms for software and analytical calculation of probabilistic
     formulas 𝑃𝑛,𝑘 (𝜀)
At the next stage, we tried, using the analysis of the obtained partial results, to establish and,
if possible, reveal the general laws governing the formation of probability formulas for the


                                               248
Aleksandr L. Reznik et al. CEUR Workshop Proceedings                                       246–251


case 𝑘 > 1. And several of these analytic patterns were indeed discovered and subsequently
rigorously proved. First, for 𝑘 = 𝑛 − 1, a simple dependence was traced and later prove. First,
for 𝑘 = 𝑛 − 1, a simple dependence was traced and later proved
                             𝑃𝑛,𝑛−1 (𝜀) = 1 − 𝑛𝜀𝑛−1 + (𝑛 − 1)𝜀𝑛 .                               (3)
(It should be recalled that formula (3) describes the probability that if 𝑛 points are randomly
dropped onto the interval (0, 1), they will not all “merge” into one compact 𝜀-grouping.)
   For 𝑘 = 𝑛 − 2, the relationship 𝑃𝑛,𝑘 (𝜀) is more complex:
                         ⎧
                         ⎨ 1 − 2𝐶𝑛2 𝜀𝑛−2 (1 − 𝜀)2 − 2𝜀𝑛 ,                          1
                                                                          0≤𝜀≤ ;
                         ⎪
           𝑃𝑛,𝑛−2 (𝜀) =                                                   1
                                                                                   2         (4)
                         ⎩ 1 − 2𝜀𝑛 + (2𝜀 − 1)𝑛 − 2𝐶𝑛2 𝜀𝑛−2 (1 − 𝜀)2 ,
                         ⎪                                                  ≤ 𝜀 ≤ 1.
                                                                          2
  For 𝑘 = 𝑛 − 3, the dependence 𝑃𝑛,𝑘 (𝜀) becomes so complicated that its reconstruction by
analyzing particular software solutions is a completely independent and difficult task:
                 1 − 2𝜀𝑛 + 𝐶𝑛1 (6𝜀𝑛 − 4𝜀𝑛−1 ) + 𝐶𝑛2 (−3𝜀𝑛 + 𝜀𝑛−2 )+
               ⎧
                                                                                        1
                                                                               0≤𝜀≤ ;
               ⎪
               ⎪
               ⎨ +𝐶 3 (9𝜀𝑛 − 18𝜀𝑛−1 + 12𝜀𝑛−2 − 3𝜀𝑛−3 ),
               ⎪
                                                                                        2
                     𝑛
 𝑃𝑛, 𝑛−3 (𝜀) =         𝑛           𝑛      1            𝑛−1           𝑛−1                  (5)
   (𝑛 > 6)
               ⎪ 1−2𝜀 + (2𝜀−1) + 𝐶𝑛 (1−𝜀)(−2𝜀              + 2(2𝜀−1)     )+ 1
               ⎪
               ⎪                                                                 ≤ 𝜀 ≤ 1.
                 +𝐶𝑛2 (1−𝜀)2 (𝜀𝑛−2 + (2𝜀−1)𝑛−2 ) − 3𝐶𝑛3 𝜀𝑛−3 (1−𝜀)3 ,          2
               ⎩

  Formulas (3)–(5) are confirmed both by software calculations and by direct analytical inte-
gration.

2.2. Formulas 𝑃𝑛,𝑘 (𝜀) found using software, analytical and discrete-
     combinatorial algorithms
The purpose of developing discrete-combinatorial methods for calculating formulas 𝑃𝑛,𝑘 (𝜀) is
that they can be used to try to find a general solution for 𝑘 = 2 by analogy with formula (1),
which is valid for 𝑘 = 1. Unfortunately, this task turned out to be much more difficult than it
seemed before the start of the research. This is primarily due to the fact that, in contrast to
the case 𝑘 = 1, the probability 𝑃𝑛,𝑘 (𝜀) consists of several piecewise-homogeneous fragments,
continuously joined at the points of “connection”. Secondly, the formula itself changes depending
on the parity of 𝑛. Thirdly, finding patterns in each of the parameter 𝜀 ranges variation requires
the creation of an individual scheme for transferring each continuous problem corresponding
to a given specific range to its own very complex discrete-probabilistic problem.
    In our proposed reduction scheme, generalized Catalan numbers appear in all subproblems
(i.e., in all ranges of variation of the parameter 𝜀). Knowing their explicit form is required when
ordering interdependent random number sequences. Most of these probabilistic-combinatorial
problems turned out to be more convenient to fomulate and solve in a “word-linguistic” form.
In a number of cases, it was possible to use the technique of finding monotonic paths in Weyl
chambers [14].
    In the case 𝑘 ≥ 2 for the probabilities 𝑃𝑛,𝑘 (𝜀), we could not find a general compact analytical
relation similar to formula (1) for 𝑘 = 1. However, using all the above computer and discrete-
combinatorial tools, including software-analytical calculations and generalized Catalan numbers,


                                                249
Aleksandr L. Reznik et al. CEUR Workshop Proceedings                                    246–251


we have established and then proved a number of particular previously unknown analytical
dependencies. In particular, for 𝜀 → 0, an asymptotic formula, common for arbitrary 𝑛, was
established:
  𝑃𝑛,2 (𝜀) = 𝐶𝑛0 + 𝐶𝑛2 (−𝑛 + 2)𝜀2 + 𝐶𝑛3 (4𝑛 − 10)𝜀3 + 𝐶𝑛4 (3𝑛2 − 37𝑛 + 86)𝜀4 +
          + 𝐶𝑛5 (−40𝑛2 + 394𝑛 − 922)𝜀5 + 𝐶𝑛6 (−15𝑛3 + 625𝑛2 − 5171𝑛 − 12086)𝜀6 +
          + 𝐶𝑛7 (420𝑛3 − 10724𝑛2 + 79996𝑛 − 187002)𝜀7 +
          + 𝐶𝑛8 (105𝑛4 − 10570𝑛3 + 205499𝑛2 − 1426841𝑛 + 3336406)𝜀8 +                        (6)
          + 𝐶𝑛9 (5040𝑛4 − 155708𝑛3 + 2267664𝑛2 − 17317506𝑛 + 52315558)𝜀9 +
          + 𝐶𝑛10 (−945𝑛5 + 189000𝑛4 − 15794625𝑛3 + 389687181𝑛2 − 3798029823𝑛+
                           10     10
          + 12998966646)𝜀       + 𝑜(𝜀 ).

  For even values of 𝑛 = 2𝑚 on the segment 1/𝑚 < 𝜀 < 1/(𝑚 − 1), the previously stated
hypothesis formula is rigorously proved
                                          1
                          𝑃2𝑚,2 (𝜀) =        𝐶 𝑚 (1 − (𝑚 − 1)𝜀)2𝑚 .
                                        𝑚 + 1 2𝑚
  For even values of 𝑛 = 2𝑚 on the segment 1/(𝑚 + 1) < 𝜀 < 1/𝑚, the formula is established
                            𝑚                      𝑚−1
               𝑃2𝑚,2 (𝜀) = 𝐶2𝑚 (1 − (𝑚 − 1)𝜀)2𝑚 − 𝐶2𝑚  (1 − (𝑚 − 1)𝜀)2𝑚 −
                            𝑚−2
                         − 𝐶2𝑚  (1 − 𝑚𝜀)𝑚+2 (1 − (𝑚 − 2)𝜀)𝑚−2 +
                             𝑚−3
                         + 2𝐶2𝑚  (1 − 𝑚𝜀)𝑚+3 (1 − (𝑚 − 2)𝜀)𝑚−3 −
                            𝑚−4
                         − 𝐶2𝑚  (1 − 𝑚𝜀)𝑚+4 (1 − (𝑚 − 2)𝜀)𝑚−4 .

  For odd values 𝑛 = 2𝑚 + 1 on the segment 1/(𝑚 + 1) < 𝜀 < 1/𝑚, the formula is established
                                 𝑚+1
                  𝑃2𝑚+1,2 (𝜀) = 𝐶2𝑚+1 (1 − 𝑚𝜀)𝑚+1 (1 − (𝑚 − 1)𝜀)𝑚 −
                                   𝑚+2
                               − 2𝐶2𝑚+1 (1 − 𝑚𝜀)𝑚+2 (1 − (𝑚 − 1)𝜀)𝑚−1 +
                                  𝑚+3
                               + 𝐶2𝑚+1 (1 − 𝑚𝜀)𝑚+3 (1 − (𝑚 − 1)𝜀)𝑚−2 .

3. Conclusion
The results presented in this paper were obtained with the help of specially created instruments
of machine analytics, as well as with the use of generalized Catalan numbers, which made it
possible to transfer the inherently continuous problem of finding probabilistic formulas to the
category of discrete-combinatorial ones. The efficiency of the proposed discrete-combinatorial
methods allows us to hope for further progress in solving the described “continuous” problem,
up to finding a general analytical formula for arbitrary values of the integer parameters 𝑛 and
𝑘 in all variation ranges of the continuous parameter 𝜀. The presence of such a generalized
analytical solution would provide researchers with an additional tool for assessing whether the
analyzed point image is random or regular.


                                               250
Aleksandr L. Reznik et al. CEUR Workshop Proceedings                                     246–251


Acknowledgments
This work was supported in part by the Russian Foundation for Basic Research (project No. 19-
01-00128), and Ministry of Science and Higher Education of the Russian Federation (project
No. 121022000116-0).


References
 [1] Shannon C.E. A mathematical theory of communication // The Bell System Technical
     Journal. 1948. Vol. 27. Is. 3. P. 379–423.
 [2] Gnedenko B.V., Belyayev Y.K., Solovyev A.D. Mathematical methods of reliability theory.
     New York: Academic press, 1969. 518 p.
 [3] Birger I.A. Technical diagnostic. Moscow: Mashinostroenie, 1978. 240 p. (In Russ.)
 [4] Gromilin G.I., Kosykh V.P., Popov S.A., Streltsov V.A. Suppression of the background
     with drastic brightness jumps in a sequence of images of dynamic small-size objects //
     Optoelectronics, Instrumentation and Data Processing. 2019. Vol. 55. No. 3. P. 213–221.
 [5] Reznik A.L., Tuzikov A.V., Soloviev A.A., Torgov A.V., Kovalev V.A. Time-optimal algo-
     rithms focused on the search for random pulsed-point sources // Computer Optics. 2019.
     Vol. 43. No. 4. P. 605–610.
 [6] Ablameiko S.V., Anischenko V.V., Lapitsky V.A., Tuzikov A.V. Medical information tech-
     nologies and systems. Minsk: OIPI NAS Belarus, 2007. 176 p.
 [7] Wójcik W., Pavlov S., Kalimoldayev M. Information technology in medical diagnostics.
     London: CRC Press, 2019. 336 p.
 [8] Poggio T., Girosi F. Networks for approximation and learning // Proceedings of the IEEE.
     1990. Vol. 78. P. 1481–1497.
 [9] Stinton P., Tinker I., Vickery I.C., Yahe S.P. The scatterogram. A new method for continuous
     electrocardiographic monitoring // Cardiovascular Research. 1972. Vol. 6. P. 598–604.
[10] Reznik A.L., Efimov V.M., Solov’ev A.A., Torgov A.V. Reliability of readout of random point
     fields with a limited number of threshold levels of the scanning aperture // Optoelectronics,
     Instrumentation and Data Processing. 2014. Vol. 50. No. 6. P. 582–588.
[11] Parzen E. Modern probability theory and its applications. New York; London: John Wiley &
     Sons Inc., 1960. 464 p.
[12] Wilks S. Mathematical statistics. Princeton: Princeton University Press, 1944. 284 p.
[13] Reznik A.L., Tuzikov A.V., Soloview A.A., Torgov A.V. Intelligent software support for
     analysis of random digital images // Computational Technologies. 2018. Vol. 23. No. 5.
     P. 70–81.
[14] Gessel I.M., Zeilberger D. Random walk in a Weyl chamber // Proceedings of the AMS.
     1992. Vol. 115. No. 1. P. 27–31.


                                               251