Model of Estimation of Distribution Density for Large Statistical Datasets Nataliya Boyko, Yaroslav Rohan and Svitlana Honchar Lviv Polytechnic National University, Profesorska Street 1, Lviv, 79013, Ukraine Abstract The distribution density function is a fundamental concept in statistics that provides a natural description of the distribution for any random variable and allows the identification of the corresponding probabilities by ratio. In this paper, we attempt to determine the 3D density distribution of galaxies in large surveys (such as the SDSS) in order to study the effect of the environment on galaxy evolution. We will also explore finding structures in large spaces, such as six-dimensional phase space, or even larger spaces in large astronomical databases (such as the SDSS database itself). This is why we are interested in accurate and efficient density estimators for astronomical data sets in several dimensions. Keywords 1 Distribution density, statistics, k-nearest neighbors, adaptive Gaussian kernel density estimation, Sloan Digital Sky Survey, Smoothed Particle Hydrodynamics. 1. Introduction Estimating density in datasets is a critical first step in progress in many areas of astronomy. For example, the galactic environment obviously plays an important role in its evolution, as observed in the ratio of color density and color concentration density. It is important to estimate the local density of galaxies for these relationships [1, 15, 20]. As another example, the reconstruction of a large-scale structure of the universe requires a proper assessment of the space field density. Even modeling requires density estimation: SPH is a method of creating a simulated astronomical structure by using astrophysical fluid dynamic calculations, which uses nuclear density estimation to solve hydrodynamic equations [3, 6, 8]. Density estimation is required not only for the analysis of spatial region structures, but also for structures in other spaces, such as the search for connected structures in six-dimensional phase space when modeling space structure formation or in three-dimensional phase space projections in satellite galaxy growth simulations [18, 22]. This paper examines the performance of four distribution density function evaluation methods [5, 11, 21]: • k-nearest neighbors (KNN); • 3D-implementation of adaptive estimation of Gaussian nucleus density called DEDICA; • a modified version of the adaptive Braiman core density estimation, called the modified Braiman estimation (MBE); • Delaunay tessellation field estimator (DTFE). The first method is well known among astronomers and involves determining the density by counting the number of nearby neighbors to the issue under consideration [2, 16]. This method is commonly used in studies of the relationship between the environment and the properties of the galaxy. The second and third methods are both adaptive nucleus density estimators, where a nucleus whose size MoMLeT+DS 2022: 4th International Workshop on Modern Machine Learning Technologies and Data Science, November, 25-26, 2022, Leiden-Lviv, The Netherlands-Ukraine EMAIL: nataliya.i.boyko@lpnu.ua (N. Boyko); yaroslav.rohan.knm.2019@lpnu.ua (Ya. Rohan); svitlana.y.honchar@lpnu.ua (S. Honchar) ORCID: 0000-0002-6962-9363 (N. Boyko); 0000-0002-9527-3145 (Ya. Rohan); 0000-0002-7420-962X (S. Honchar) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) adapts to local conditions (usually isotropic), depending on certain criteria established before or iteratively during process estimation, is used to smooth the point distribution so that typically density can be estimated. The fourth method, like the first, uses the position of neighbors to estimate the local densities. 2. Overview of methods 2.1. Study of the object of study The purpose of density estimation is to approximate the true probability density function (pdf) of a random process from the observed data set. There are two main families of density estimators: parametric and nonparametric. In parametric methods, the type of distribution (uniform, normal, Poisson, etc.) must be known (or guessed) in advance, while nonparametric methods do not require this information. The methods considered in this study belong to the second type [4, 10, 19]. First, we must distinguish between different types of design density. Starting from the input data set, which consists of a list of positions of points 𝑟𝑖 ∈ ℝd, i = 1, ..., N in the d-dimensional spatial region, we define two types of probability density as [22, 26]: 1. Point probabilities: probability densities ṕ(𝑟𝑖 ) in the initial positions of the points ri; 2. Probability density field: probability density ṕ(r) at arbitrary points in the spatial region ℝd. We often estimate the field density at Cartesian grid points, so we also talk about grid density. In addition, the probability density must be converted to the physical density when comparing galaxies. This is because the parameter of interest is the quantification of the environment of individual galaxies, not the probability of finding the galaxy in a particular position. The latter is calculated using density estimators and can be converted to the first by multiplying by N, namely [6, 12, 27]: 1. Density of point numbers: ṕ(𝑟𝑖 ) = 𝑁ṕ(𝑟𝑖 ); 2. Density field of numbers: ṕ(𝑟) = 𝑁ṕ(𝑟). 2.2. Method k nearest neighbors The KNN estimator is well known in astronomy, and its principle is to focus a spherical window on each point r and allow it to grow until it captures k samples (k nearest neighbors r). Then the estimate of the density KNN for a data set with N data points is determined in any r ∈ℝd (Formula 1) [7, 13, 25]: 1 𝑘 (1) ṕ(𝑟) = 𝑑, 𝑁 𝑉𝑑 𝛿𝑘 where δk is the distance of the k-th nearest neighbor from r and Vd is the volume of a unit sphere in d- dimensional space. The KNN approach uses a different window size for each point, so it adapts to the local density: when the density is high near r, the window will be small; but when the local density is low, the window will grow to a larger size. The KNN approach can be a good solution for finding the "best" window size. However, this method suffers from a number of disadvantages. The obtained density estimate is not a proper probability density, because its integral differs in all spaces, and the tails fall out extremely slowly. The density field is very "prickly", and the calculated density is far from zero, even in large regions where no samples are observed due to heavy tails. In addition, it leads to gaps, even when the main distributions are continuous [14, 17, 23]. In astronomical works, it is typical that the sampling point is not considered its own neighbor. This is a conceptual problem because the point density will then disagree with the field density at the location of the sample point. In this paper, we take the sampling point as its own first neighbor, as in Silverman (1986), and use the average value of the estimated KNN densities with k = 5 and k = 6 when calculating either the density of the point or the grid. This is not exactly equivalent to the average density k = 4 and k = 5 KNN used in many astronomical works (for example, Baldry et al. 2006). While V in the denominator of equation (1) would be equal, k in the denominator is one higher by Silverman's definition [9, 21, 27]. 2.3. Epanechnikov adaptive estimation of nucleus density Braiman (1977) described the case of an adaptive (Gaussian) nucleus. This method begins by calculating the distance δi, k to the k-th nearest neighbor of each data point located on ri, similar to the density estimator KNN. Instead of using this distance to calculate the KNN density estimate, it uses it to control the local core size (also known as bandwidth) in the adaptive core density estimate or in the Parzen estimate. To sample DN from N points with position vectors ri∈ℝd (i = 1, ..., N) and the nucleus K (r), the adaptive density of the nucleus ṕ(r) is estimated by (Formula2) [17, 24, 28]: 1 −𝑑 𝑟−𝑟𝑖 (2) ṕ(𝑟) = ∑𝑁 𝑖=1(𝛼𝑘 𝛿𝑖,𝑘 ) 𝐾( ). 𝑁 𝛼𝑘 𝛿𝑖,𝑘 In his simulations, Braiman use a symmetric Gaussian kernel. Here k and αk still need to be determined. For k or αk, too small a result will be noisy, whereas if k and αk are large, details are lost. The eigenvalues of the parameters for σ (width of the normal distribution), k and αk were determined by optimizing certain eligibility criteria [11, 19]. Silverman (1986) argues that we can interpret this as the use of a "pilot estimate" of density. We can understand this by observing from equation (1) that (Formula 3): −𝑑 (3) ṕ𝑘𝑁𝑁 (𝑟𝑖 ) ∝ 𝛿𝑖,𝑘 . Thus, the bandwidth at each location is proportional. Thus, the density estimate of the KNN pilot level is implicitly used to control the final density estimate. The effect is that in low-density regions δi, k will be large and the nucleus will expand; in high-density regions the opposite happens. 2.4. Fundamentals of the Modified Braiman Estimator (MBE) Braiman's approach, which is used to find the correct parameter values, is computationally expensive because it need to run the estimator many times to find the optimal parameters. This is even more expensive because the kernel has endless support. This means that each data point contributes to the density at each position, so that O (N2) is worth testing the parameters [12, 22]. We plan to apply the method to astronomical datasets that are very large (> 50,000 data points) and dimensional (10 to hundreds). For this reason, we use a rapid and scalable modification of the Braiman's method according to the principles of Wilkinson and Meyer (1995). Silverman (1986) noted that an implicit pilot estimate of KNN can be replaced by another estimate without significant changes in quality. Therefore, Wilkinson and Meyer used the core density estimator itself to evaluate the pilot. In addition, they replaced the infinite support of the Gaussian kernel with the finite support of the Epanechnikov kernel, which significantly increases the computational speed and is optimal in terms of the minimum mean integral square error. To increase the computational speed of the pilot estimation, the pilot density field is first calculated at the grid points, after which the pilot signal density for each data point is obtained by multiline interpolation. The method is also scalable: even when the number of data points grows very large, the calculation time remains limited by the number of grid points [13, 25]. In the modified version of equation (2) becomes (Formula 4): 𝑁 (4) 1 𝑟 − 𝑟𝑖 ṕ(𝑟) = ∑(𝜎𝜆𝑖 )−𝑑 𝐾𝑒 ( ), 𝑁 𝜎𝜆𝑖 𝑖=1 where Ke is the Epanechnikov core defined as (Formuls 5): 𝑑+2 (5) (1 − 𝑡. 𝑡)𝑖𝑓𝑡 ∗ 𝑡 < 1 𝐾𝑒 (𝑡) = { 2𝑉𝑑 , 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 where Vd is the volume of a unit sphere in d-dimensional space. Estimation of density proceeds in two phases [11, 19]. Phase 1. Calculate the optimal width of the window experiment σopt with the percentage of data defined in equation (6) below. Determine the density of the pilot using equation (4) for σ = σopt і λi = 1. Phase 2. From the density of the pilot to calculate the local bandwidth parameters λi on (Formula 6) −𝛼 ṕ𝑝𝑖𝑙𝑜𝑡 (𝑟𝑖 ) (6) 𝜆𝑖 = ( ) . 𝑔 Here g is the geometric mean value of the pilot density, and α = 1 / d is the sensitivity parameter. The value of 1 / d is chosen to be equivalent to the Braiman's method, although some authors prefer the value of 1/2 regardless of d. The final estimate of the density is given by the equation. (4) again, but now for σ = σopt and λi, as given by equation (6). Compared to the original Braiman's method, it should be noted that a fixed window width σopt is used for the pilot assessment, rather than a fixed value of k. During the second phase of the algorithm, we change the width of the window with the density at each data point using the local bandwidth parameter. Data points with a low pilot score get a large window and vice versa. 2.5. Adaptive estimation of Gaussian nucleus density (DEDICA) Pisani proposed a kernel-based density estimation method for multivariate data, which is a continuation of his work for the universal case. Again, this is an adaptive kernel evaluator. The main differences of the MBE method are that the Gaussian core is used and that the optimal bandwidths are determined iteratively, minimizing the cross-checking estimate. The study uses a 3D density estimator DEDICA, which is an implementation of FORTRAN Pisani [3, 8, 19]. For a sample of DN from N points with position vectors ri∈ℝd, (i = 1, ..., N) and the core width of the i-th point given σi, the adaptive estimate of the Gaussian core density is given (Formula 7): 1 (7) ṕ(𝑟) = ∑𝑁 𝐾 (|𝑟𝑖 − 𝑟|, 𝜎𝑖 ), 𝑁 𝑖=1 𝑁 where Kn (t, σ) is the standard d-dimensional Gaussian nucleus (Formula 8): 1 𝑡2 (8) 𝐾𝑛 (𝑡, 𝜎) = 2 𝑑/2 𝑒𝑥𝑝 [− 2 ] (2𝜋𝜎 ) 2𝜎 The kernel width σi is chosen by an iterative method that minimizes the local error of the integrated square. Procedure as follows: 1. The window width is initialized ( Formula 9): 𝑑 −𝑑+4 1 1 𝜎 (0) = 4𝜎𝑡 , 𝜎𝑡 = 𝐴(𝐾)𝑁 √ ∑ 𝑠𝑢2 , (9) 𝑑 𝑖=1 where su is the standard deviation of the l-th coordinate of the data, and A (K) = 0.96 for the Gaussian kernel. 2. Iteratively perform the following steps for n = 1,2, ...: (a) halve the width of the window: σ (n) = σ (n - 1) / 2; (b) calculate the pilot estimate for (7) with fixed core sizes σi = σ (n); (c) calculate the local throughput coefficients through (6) with α = 1/2; (d) calculate the adaptive core estimate for (7) with adaptive core sizes; (e) calculate the cross-validation estimate (Pisani 1996, level (7)) (Formula 10): 1 (𝑛) 1 (𝑛) 2 (𝑛) 2 2 𝑀 (ṕ𝑘𝑎 ) = 2 ∑𝑁𝑖=1 ∑𝑁 𝑗=1 𝐾𝑛 (|𝑟𝑖 − 𝑟𝑗 |, (((𝜎𝑖 ) + (𝜎𝑗 ) ) ) − 𝑁 (10) 2 ∑𝑁 ∑ 𝐾 (|𝑟𝑖 − 𝑟𝑗 |, 𝜎𝑗(𝑛) ). 𝑁(𝑁−1) 𝑖=1 𝑗≠𝑖 𝑛 Minimizing the cross-checking estimate is equivalent to minimizing the integral square error between the true density and the calculated density [5, 14, 20]. 3. Determine the iteration number n = nopt, for which the cross-check estimate is minimized, and return the corresponding optimal window widths and the adaptive core density estimate at the sampling points. The cross-checking procedure can be understood by looking at the behavior of different terms. As it decreases during the iteration, some conditions will continue to increase, while others will begin to decrease as the local windows size becomes much smaller than the interpoint distances. This is the point when the minimum is reached and the iteration stops [4, 29]. Although, as shown below, DEDICA gives good results in many cases, in some situations it fails. This can be attributed to some disadvantages of the method. First, the fixed core sizes σ (n) used for the pilot estimates form a discrete series of values (determined by the choice of σ (0)). This range of values may be too rough to find the optimal window width. Second, the method seeks to achieve what leads to a globally optimal result, which, however, may not be optimal in some regions [12, 19]. An extension to the DEDICA code was developed in this study to obtain the grid density, because the original code calculates only the point density. The optimal window widths of each point, calculated during the estimation of the point density, are used to obtain an adaptive estimation of the core density at each point of the grid r by (7) (Fig. 1). Figure 1: Simulated data sets with known density distributions 2.6. Delaunay Tessellation Field Estimator (DTFE) DTFE is a well-known in astronomy method of density fields reconstructing from a discrete set of scattered points. In this method, the Delaunay tessellation of points is first constructed. Then the density of a point is defined as inverse to the total volume V of the surrounding tetrahedra (in 3D) of each point multiplied by the normalization constant. To sample DN from N points with position vectors ∈ℝd, (i = 1, ..., N) the DTFE density estimate is given (Formula 11): 1𝑑+1 ṕ(𝑟𝑖 ) = , (11) 𝑁 𝑉𝑖 where 𝑉𝑖 = ∑𝐾 𝑗=1 𝑉𝑡𝑒𝑡𝑟𝑎,𝑗 . Here Vtetra, j is the volume of the j-th tetrahedron and K is the number of tetrahedra containing the point ri. In the next step, the density field is obtained by linear interpolation of point densities at the vertices of Delaunay tetrahedra to the full sample size. 2.7. Formulation of the problem Galaxies are strongly influenced by their environment. Quantifying the density of a galaxy is a difficult but critical step in studying the properties of galaxies. Therefore, the aim is to identify differences in density estimation methods and their application in astronomical problems. We study the effectiveness of four density estimation methods: k-nearest neighbors (KNN), adaptive Gaussian nucleus density estimation (DEDICA), a special case of adaptive Epanechnikov nucleus density estimation (MBE), and Delaunay tessellation field estimator (DTFE). 3. Analytical section 3.1. Dataset Investigated the effectiveness of four density estimation methods on three classes of datasets: a series of simulated datasets with known density fields to test the ability to recover relatively simple density distributions of each method; astronomical data set with unknown but well-selected density field based on millennium simulations; and two different observed galaxy samples taken from SDSS. 3.2. Modulated data sets with known density fields We start by constructing six simulated data sets with known density distributions (Fig. 1). Data set 1 is the unimodal Gaussian distribution with added uniform noise. Data set 2 contains two Gaussian distributions with the same number of points, but different covariance matrices (MC) and different centers, again with added uniform noise; this data set has the same number of points as 1. Data set 3 contains four Gaussian distributions with equal number of points, but different KM and different centers, again with added uniform noise; this data set has twice as many points as data sets 1 and 2. Data set 4 contains a wall and thread-like structure. The x and y coordinates of the wall-like structure are derived from the uniform distribution, and the z-coordinate is derived from the Gaussian distribution. The pod-like structure is created with a Gaussian distribution in the x and y coordinates and a uniform z-coordinate distribution. Data set 5 contains three wall-like structures, where each wall is created with a uniform distribution in two dimensions and a Gaussian distribution in the third. Data set 6 contains points derived from the lonormal distribution. The representation of the scattering graphs of these data sets is shown in Fig. 2. Figure 2: Graphical representations of scattered simulated data sets. Left to right, top to bottom: data sets 1-6 3.3. Astronomical data sets with fields of unknown density Three astronomical data sets are used to test the performance of methods on astronomical data: semi- analytical model galaxies performed as a result of Millennium modeling, and two samples of galaxies performed with SDSS. 3.3.1. MSG data set The first astronomical data set consists of a L-Galaxy "millimil" experimental model Millennium sample. Simulation Millennium is one of the largest simulations that has ever studied the evolution of the universe, after almost 2 × 1010 particles. It was created to predict the scale structure of the universe and compare them with observational data and astrophysical theories. L-Galaxies are created by inhabiting halo trees extracted from the Millunium simulation with semi-analytical models according to the commandments of De Lucius and Blaisot. A much smaller “milliMillennium” simulator (“milliMil”) is used, which took a sample of only ~ 2 × 107 particles and associated L-galaxy data. This dataset is referred to as the MSG dataset, which contains 53,918 points. In the visual representation, the simulation output looks like a thin three-dimensional fabric of threads with fractal self-similarity and several layers of organization. Figure 3: Representation of plot graphs of MSG data sets. Top to bottom: MSG data, MSG-DTFE data set, MSG- MBE data set The goal is to use the MSG dataset complexity for testing the methods performance with a well- chosen but reasonably "astronomical" setting. Unfortunately, the true core density field of the MSG dataset is unknown. Therefore, MSG samples are downloaded to determine the "true density" for astronomical data. The MSG data density field is used to create new datasets, and their density is considered to be the true density of these datasets. The process of creating new datasets can be described as follows: Step 1: Calculate the density field of the MSG data set using one of the density estimation methods. Create a new data set that has a probability density function similar to MSG data as follows: 1. Generate a random position ri(x, y, z) within the original sample and a random value of p between zero and the maximum density of the sample field; 2. Interpolate the density P of the point ri(x, y, z) in the field obtained from step 1; 3. If p