On the statistics of anomalous clumps in random point images Aleksandr L. Reznik1 , Aleksandr A. Soloviev1 and Andrey V. Torgov1 1 Institute of Automation and Electrometry of the Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia Abstract New algorithms for calculating exact analytical formulas describing two related probabilities are proposed, substantiated and software implemented: 1) the probability of the formation of anomalously large local groups in a random point image; 2) the probability of the absence of significant local groupings in a random point image. Keywords Random point image, computer analysis, local groupings. 1. Introduction In many scientific and technical disciplines, when solving applied problems related to digital image and signal processing, it becomes necessary to assess the degree of randomness of the analyzed image fragments, depending on the presence or absence of local groupings of point objects on them (as a result, the analyzed fragment significantly differs from ambient background). Such problems arise in many scientific and technical disciplines and can be both purely theoretical [1, 2] and purely applied [3]. For example, the presence of local clumps in processed aerospace images [4, 5] may indicate the presence of latent objects within the analyzed fragment that require more detailed study. In computer processing of biomedical images, one of the most important moments of the preliminary processing stage is the search for abnormal heterogeneities and thickenings, which may be evidence of various disease- causing abnormalities that require priority attention [6, 7]. In correlation rhythmography, a method is known that makes it possible to construct prognostic assessments of the possibility of restoration of sinus rhythm by a set of intervals on the cardiogram that form an autoregressive cloud (scatterogram) [8, 9]. Mathematically similar problems arise when studying the process of registering random point fields using a scanning aperture with a limited number of threshold levels. When fixing random coordinates of small-sized (ideally, point) objects that form such a field, a failure occurs at the moment when the number of signal point objects located within the scanning aperture exceeds the specified threshold level. It is shown in [10] that in cases where the analyzed image is formed by a random Poisson flux of constant intensity, the two-dimensional problem of SDM-2021: All-Russian conference, August 24–27, 2021, Novosibirsk, Russia " reznik@iae.nsk.su (A. L. Reznik); soloviev@iae.nsk.su (A. A. Soloviev); torgov@iae.nsk.su (A. V. Torgov) Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 246 Aleksandr L. Reznik et al. CEUR Workshop Proceedings 246–251 estimating the probability that the registration process will be carried out without failures is reduced (this is achieved by means of standard factorization) to the following one-dimensional problem: β€œIt is required to find the probability of the event that if 𝑛 points are randomly dropped on the interval (0, 1), not a single grouping will be formed, located in a certain subinterval β„¦πœ€ βŠ‚ (0, 1), having a length πœ€ and including more than π‘˜ points”. When solving applied problems related to the detection of abnormally large clumps in the analyzed images (as, for example, in the above-mentioned problems related to the processing and analysis of aerospace or biomedical images), knowledge of probability formulas 𝑃𝑛,π‘˜ (πœ€) is required in which the value of the integer parameter π‘˜ is as close as possible to the value 𝑛. In such cases, it becomes necessary to know the exact analytical dependencies 𝑃𝑛,π‘›βˆ’1 (πœ€), 𝑃𝑛,π‘›βˆ’2 (πœ€), 𝑃𝑛,π‘›βˆ’3 (πœ€), etc. In particular, the probability 𝑃𝑛,π‘›βˆ’1 (πœ€) corresponds to the fact that if n points are randomly thrown over the interval (0, 1), all of them will not β€œmerge” into one πœ€-grouping, the probability 𝑃𝑛,π‘›βˆ’2 (πœ€) is that no πœ€-grouping with a size larger than 𝑛 βˆ’ 2 will be formed, the probability 𝑃𝑛,π‘›βˆ’3 (πœ€) β€” that no πœ€-groupings larger than 𝑛 βˆ’ 3 will be formed, etc. On the other hand, in a number of applied problems, on the contrary, it is required to know the exact analytical relations for the probabilities when the value of the threshold parameter π‘˜ is minimal, i.e. when π‘˜ = 1, 2, 3, etc. Such formulas are needed in cases when, by the nature of the research, it is required to estimate the probability that, within the studied interval, the distribution of 𝑛 random point markers-objects is such that the number of any πœ€-group does not exceed the threshold level π‘˜ = 1, 2, 3, etc. The purpose of this work is to propose analytical methods and software algorithms for finding exact probabilistic formulas 𝑃𝑛,π‘˜ (πœ€) for both the maximum (that is, as close as possible to 𝑛) and the minimum values of the threshold parameter π‘˜. 2. Obtaining particular solutions to a problem using computer analytics programs The simplicity of the problem posed in the introduction is illusory, and its analytical solution is known only for the simplest case π‘˜ = 1 [11, 12]: 1 𝑃𝑛,𝑙 (πœ€) = (1 βˆ’ (𝑛 βˆ’ 1)πœ€)𝑛 , 0β‰€πœ€β‰€ . (1) π‘›βˆ’1 Formula (1) describes the probability of an event that if 𝑛 points are randomly dropped onto the interval (0, 1), not a single πœ€-group will be formed containing at least 2 points, that all the ejected points will be located between themselves at a distance exceeding πœ€. The classical way to obtain solution (1) is to represent the desired probability in the form of an easily integrable iterated integral [11]: ⎧ ⎧ π‘₯ βˆ’πœ€ ⎧ π‘₯ βˆ’πœ€ ⎧ π‘₯ βˆ’πœ€ ⎫⎫⎫⎫ ∫︁1 ⎨ π‘₯βˆ«οΈπ‘› βˆ’πœ€ βŽͺ ⎨ ∫︁4 ⎨ ∫︁3 ⎨ ∫︁2 ⎬⎬⎬βŽͺ ⎬ 𝑃𝑛,1 (πœ€) = 𝑛! 𝑑π‘₯𝑛 𝑑π‘₯π‘›βˆ’1 . . . 𝑑π‘₯3 𝑑π‘₯2 𝑑π‘₯1 . βŽͺ ⎩ ⎩ ⎩ ⎩ ⎭⎭⎭βŽͺ ⎭ (π‘›βˆ’1)πœ€ (π‘›βˆ’2)πœ€ 2πœ€ πœ€ 0 247 Aleksandr L. Reznik et al. CEUR Workshop Proceedings 246–251 Solution (1) can be obtained in a different way. For example, in [10], a simple probabilistic- geometric method was proposed that allows one to calculate the probability (1) without resorting to the procedure of multidimensional integration. Thus, it is not difficult to find a solution to the main problem when π‘˜ = 1. But for π‘˜ > 1 the problem becomes much more complicated. Here, our efforts have led to the results that will be given below. First, note that for arbitrary fixed values of 𝑛 and π‘˜, the desired solution can be represented in the form of an 𝑛-fold integral: ∫︁ ∫︁ 𝑃𝑛,π‘˜ (πœ€) = 𝑛! Β· Β· Β· 𝑑π‘₯1 . . . 𝑑π‘₯𝑛 , (2) 𝐷𝑛,π‘˜ (πœ€) where the domain of integration 𝐷𝑛,π‘˜ (πœ€) is given by the system of linear inequalities ⎧ βŽͺ βŽͺ 0 < π‘₯1 < π‘₯2 < Β· Β· Β· < π‘₯π‘›βˆ’1 < π‘₯𝑛 < 1, ⎨ π‘₯π‘˜+1 βˆ’ π‘₯1 > πœ€, βŽͺ βŽͺ βŽͺ π‘₯π‘˜+2 βˆ’ π‘₯2 > πœ€, .. . βŽͺ βŽͺ βŽͺ βŽͺ βŽͺ π‘₯𝑛 βˆ’ π‘₯π‘›βˆ’π‘˜ > πœ€. ⎩ To calculate integral (2), we developed a method of successive dimensionality reduction based on the step-by-step replacement of the initial 𝑛-fold integral with a set of structurally similar, but having dimension one less than the iterated integrals. Further, formalizing this method and applying cyclic recursion, two systems for analytical calculation of probabilities were designed and implemented as a computer software in order to calculate the desired piecewise- polynomial dependence in the form of functions of the continuous parameter πœ€. One system calculates the limits of integration for each of the iterated integrals into which the original 𝑛-fold integral (2) decomposes; the second software system is based on multiple differentiation of the integral (2) with respect to the parameter πœ€. In addition to the two mentioned software systems, a third algorithmic scheme was also developed and implemented as software, using a discrete-combinatorial model to calculate probabilistic formulas 𝑃𝑛,π‘˜ (πœ€). Analytical calculations performed using the listed software systems made it possible to find a complete set of partial formulas 𝑃𝑛,π‘˜ (πœ€) in all ranges of variation of the continuous parameter πœ€ for all values of integer parameters 𝑛 and π‘˜ up to 𝑛 = 14. Note that their calculation is associated with a large amount of routine operations on setting the limits of integration, checking intermediate systems of inequalities for consistency, and performing direct integration in 𝑛-dimensional space, which is almost impossible to do β€œmanually” even for 𝑛 = 4. Therefore, all the necessary software calculations were carried out using the parallel computing algorithms the use of high-performance computing clusters [13]. 2.1. Algorithms for software and analytical calculation of probabilistic formulas 𝑃𝑛,π‘˜ (πœ€) At the next stage, we tried, using the analysis of the obtained partial results, to establish and, if possible, reveal the general laws governing the formation of probability formulas for the 248 Aleksandr L. Reznik et al. CEUR Workshop Proceedings 246–251 case π‘˜ > 1. And several of these analytic patterns were indeed discovered and subsequently rigorously proved. First, for π‘˜ = 𝑛 βˆ’ 1, a simple dependence was traced and later prove. First, for π‘˜ = 𝑛 βˆ’ 1, a simple dependence was traced and later proved 𝑃𝑛,π‘›βˆ’1 (πœ€) = 1 βˆ’ π‘›πœ€π‘›βˆ’1 + (𝑛 βˆ’ 1)πœ€π‘› . (3) (It should be recalled that formula (3) describes the probability that if 𝑛 points are randomly dropped onto the interval (0, 1), they will not all β€œmerge” into one compact πœ€-grouping.) For π‘˜ = 𝑛 βˆ’ 2, the relationship 𝑃𝑛,π‘˜ (πœ€) is more complex: ⎧ ⎨ 1 βˆ’ 2𝐢𝑛2 πœ€π‘›βˆ’2 (1 βˆ’ πœ€)2 βˆ’ 2πœ€π‘› , 1 0β‰€πœ€β‰€ ; βŽͺ 𝑃𝑛,π‘›βˆ’2 (πœ€) = 1 2 (4) ⎩ 1 βˆ’ 2πœ€π‘› + (2πœ€ βˆ’ 1)𝑛 βˆ’ 2𝐢𝑛2 πœ€π‘›βˆ’2 (1 βˆ’ πœ€)2 , βŽͺ ≀ πœ€ ≀ 1. 2 For π‘˜ = 𝑛 βˆ’ 3, the dependence 𝑃𝑛,π‘˜ (πœ€) becomes so complicated that its reconstruction by analyzing particular software solutions is a completely independent and difficult task: 1 βˆ’ 2πœ€π‘› + 𝐢𝑛1 (6πœ€π‘› βˆ’ 4πœ€π‘›βˆ’1 ) + 𝐢𝑛2 (βˆ’3πœ€π‘› + πœ€π‘›βˆ’2 )+ ⎧ 1 0β‰€πœ€β‰€ ; βŽͺ βŽͺ ⎨ +𝐢 3 (9πœ€π‘› βˆ’ 18πœ€π‘›βˆ’1 + 12πœ€π‘›βˆ’2 βˆ’ 3πœ€π‘›βˆ’3 ), βŽͺ 2 𝑛 𝑃𝑛, π‘›βˆ’3 (πœ€) = 𝑛 𝑛 1 π‘›βˆ’1 π‘›βˆ’1 (5) (𝑛 > 6) βŽͺ 1βˆ’2πœ€ + (2πœ€βˆ’1) + 𝐢𝑛 (1βˆ’πœ€)(βˆ’2πœ€ + 2(2πœ€βˆ’1) )+ 1 βŽͺ βŽͺ ≀ πœ€ ≀ 1. +𝐢𝑛2 (1βˆ’πœ€)2 (πœ€π‘›βˆ’2 + (2πœ€βˆ’1)π‘›βˆ’2 ) βˆ’ 3𝐢𝑛3 πœ€π‘›βˆ’3 (1βˆ’πœ€)3 , 2 ⎩ Formulas (3)–(5) are confirmed both by software calculations and by direct analytical inte- gration. 2.2. Formulas 𝑃𝑛,π‘˜ (πœ€) found using software, analytical and discrete- combinatorial algorithms The purpose of developing discrete-combinatorial methods for calculating formulas 𝑃𝑛,π‘˜ (πœ€) is that they can be used to try to find a general solution for π‘˜ = 2 by analogy with formula (1), which is valid for π‘˜ = 1. Unfortunately, this task turned out to be much more difficult than it seemed before the start of the research. This is primarily due to the fact that, in contrast to the case π‘˜ = 1, the probability 𝑃𝑛,π‘˜ (πœ€) consists of several piecewise-homogeneous fragments, continuously joined at the points of β€œconnection”. Secondly, the formula itself changes depending on the parity of 𝑛. Thirdly, finding patterns in each of the parameter πœ€ ranges variation requires the creation of an individual scheme for transferring each continuous problem corresponding to a given specific range to its own very complex discrete-probabilistic problem. In our proposed reduction scheme, generalized Catalan numbers appear in all subproblems (i.e., in all ranges of variation of the parameter πœ€). Knowing their explicit form is required when ordering interdependent random number sequences. Most of these probabilistic-combinatorial problems turned out to be more convenient to fomulate and solve in a β€œword-linguistic” form. In a number of cases, it was possible to use the technique of finding monotonic paths in Weyl chambers [14]. In the case π‘˜ β‰₯ 2 for the probabilities 𝑃𝑛,π‘˜ (πœ€), we could not find a general compact analytical relation similar to formula (1) for π‘˜ = 1. However, using all the above computer and discrete- combinatorial tools, including software-analytical calculations and generalized Catalan numbers, 249 Aleksandr L. Reznik et al. CEUR Workshop Proceedings 246–251 we have established and then proved a number of particular previously unknown analytical dependencies. In particular, for πœ€ β†’ 0, an asymptotic formula, common for arbitrary 𝑛, was established: 𝑃𝑛,2 (πœ€) = 𝐢𝑛0 + 𝐢𝑛2 (βˆ’π‘› + 2)πœ€2 + 𝐢𝑛3 (4𝑛 βˆ’ 10)πœ€3 + 𝐢𝑛4 (3𝑛2 βˆ’ 37𝑛 + 86)πœ€4 + + 𝐢𝑛5 (βˆ’40𝑛2 + 394𝑛 βˆ’ 922)πœ€5 + 𝐢𝑛6 (βˆ’15𝑛3 + 625𝑛2 βˆ’ 5171𝑛 βˆ’ 12086)πœ€6 + + 𝐢𝑛7 (420𝑛3 βˆ’ 10724𝑛2 + 79996𝑛 βˆ’ 187002)πœ€7 + + 𝐢𝑛8 (105𝑛4 βˆ’ 10570𝑛3 + 205499𝑛2 βˆ’ 1426841𝑛 + 3336406)πœ€8 + (6) + 𝐢𝑛9 (5040𝑛4 βˆ’ 155708𝑛3 + 2267664𝑛2 βˆ’ 17317506𝑛 + 52315558)πœ€9 + + 𝐢𝑛10 (βˆ’945𝑛5 + 189000𝑛4 βˆ’ 15794625𝑛3 + 389687181𝑛2 βˆ’ 3798029823𝑛+ 10 10 + 12998966646)πœ€ + π‘œ(πœ€ ). For even values of 𝑛 = 2π‘š on the segment 1/π‘š < πœ€ < 1/(π‘š βˆ’ 1), the previously stated hypothesis formula is rigorously proved 1 𝑃2π‘š,2 (πœ€) = 𝐢 π‘š (1 βˆ’ (π‘š βˆ’ 1)πœ€)2π‘š . π‘š + 1 2π‘š For even values of 𝑛 = 2π‘š on the segment 1/(π‘š + 1) < πœ€ < 1/π‘š, the formula is established π‘š π‘šβˆ’1 𝑃2π‘š,2 (πœ€) = 𝐢2π‘š (1 βˆ’ (π‘š βˆ’ 1)πœ€)2π‘š βˆ’ 𝐢2π‘š (1 βˆ’ (π‘š βˆ’ 1)πœ€)2π‘š βˆ’ π‘šβˆ’2 βˆ’ 𝐢2π‘š (1 βˆ’ π‘šπœ€)π‘š+2 (1 βˆ’ (π‘š βˆ’ 2)πœ€)π‘šβˆ’2 + π‘šβˆ’3 + 2𝐢2π‘š (1 βˆ’ π‘šπœ€)π‘š+3 (1 βˆ’ (π‘š βˆ’ 2)πœ€)π‘šβˆ’3 βˆ’ π‘šβˆ’4 βˆ’ 𝐢2π‘š (1 βˆ’ π‘šπœ€)π‘š+4 (1 βˆ’ (π‘š βˆ’ 2)πœ€)π‘šβˆ’4 . For odd values 𝑛 = 2π‘š + 1 on the segment 1/(π‘š + 1) < πœ€ < 1/π‘š, the formula is established π‘š+1 𝑃2π‘š+1,2 (πœ€) = 𝐢2π‘š+1 (1 βˆ’ π‘šπœ€)π‘š+1 (1 βˆ’ (π‘š βˆ’ 1)πœ€)π‘š βˆ’ π‘š+2 βˆ’ 2𝐢2π‘š+1 (1 βˆ’ π‘šπœ€)π‘š+2 (1 βˆ’ (π‘š βˆ’ 1)πœ€)π‘šβˆ’1 + π‘š+3 + 𝐢2π‘š+1 (1 βˆ’ π‘šπœ€)π‘š+3 (1 βˆ’ (π‘š βˆ’ 1)πœ€)π‘šβˆ’2 . 3. Conclusion The results presented in this paper were obtained with the help of specially created instruments of machine analytics, as well as with the use of generalized Catalan numbers, which made it possible to transfer the inherently continuous problem of finding probabilistic formulas to the category of discrete-combinatorial ones. The efficiency of the proposed discrete-combinatorial methods allows us to hope for further progress in solving the described β€œcontinuous” problem, up to finding a general analytical formula for arbitrary values of the integer parameters 𝑛 and π‘˜ in all variation ranges of the continuous parameter πœ€. The presence of such a generalized analytical solution would provide researchers with an additional tool for assessing whether the analyzed point image is random or regular. 250 Aleksandr L. Reznik et al. CEUR Workshop Proceedings 246–251 Acknowledgments This work was supported in part by the Russian Foundation for Basic Research (project No. 19- 01-00128), and Ministry of Science and Higher Education of the Russian Federation (project No. 121022000116-0). References [1] Shannon C.E. A mathematical theory of communication // The Bell System Technical Journal. 1948. Vol. 27. Is. 3. P. 379–423. [2] Gnedenko B.V., Belyayev Y.K., Solovyev A.D. Mathematical methods of reliability theory. New York: Academic press, 1969. 518 p. [3] Birger I.A. Technical diagnostic. Moscow: Mashinostroenie, 1978. 240 p. (In Russ.) [4] Gromilin G.I., Kosykh V.P., Popov S.A., Streltsov V.A. Suppression of the background with drastic brightness jumps in a sequence of images of dynamic small-size objects // Optoelectronics, Instrumentation and Data Processing. 2019. Vol. 55. No. 3. P. 213–221. [5] Reznik A.L., Tuzikov A.V., Soloviev A.A., Torgov A.V., Kovalev V.A. Time-optimal algo- rithms focused on the search for random pulsed-point sources // Computer Optics. 2019. Vol. 43. No. 4. P. 605–610. [6] Ablameiko S.V., Anischenko V.V., Lapitsky V.A., Tuzikov A.V. Medical information tech- nologies and systems. Minsk: OIPI NAS Belarus, 2007. 176 p. [7] WΓ³jcik W., Pavlov S., Kalimoldayev M. Information technology in medical diagnostics. London: CRC Press, 2019. 336 p. [8] Poggio T., Girosi F. Networks for approximation and learning // Proceedings of the IEEE. 1990. Vol. 78. P. 1481–1497. [9] Stinton P., Tinker I., Vickery I.C., Yahe S.P. The scatterogram. A new method for continuous electrocardiographic monitoring // Cardiovascular Research. 1972. Vol. 6. P. 598–604. [10] Reznik A.L., Efimov V.M., Solov’ev A.A., Torgov A.V. Reliability of readout of random point fields with a limited number of threshold levels of the scanning aperture // Optoelectronics, Instrumentation and Data Processing. 2014. Vol. 50. No. 6. P. 582–588. [11] Parzen E. Modern probability theory and its applications. New York; London: John Wiley & Sons Inc., 1960. 464 p. [12] Wilks S. Mathematical statistics. Princeton: Princeton University Press, 1944. 284 p. [13] Reznik A.L., Tuzikov A.V., Soloview A.A., Torgov A.V. Intelligent software support for analysis of random digital images // Computational Technologies. 2018. Vol. 23. No. 5. P. 70–81. [14] Gessel I.M., Zeilberger D. Random walk in a Weyl chamber // Proceedings of the AMS. 1992. Vol. 115. No. 1. P. 27–31. 251