Analysis Method for Determining the Suitability of Water for Human Consumption Oleksiy Tverdokhlib 1, Denis Shavaev 1 Yurii Matseliukh 1, Aleksandr Gozhyj2, Anna Maria Trzaskowska3, Maksym Korobchynskyi4, Lyubomyr Chyrun5, and Irina Kalinina2 1 Lviv Polytechnic National University, S. Bandera Street, 12, Lviv, 79013, Ukraine 2 Petro Mohyla Black Sea National University, Desantnykiv Street, 68, Mykolayiv, 54000, Ukraine 3 Gdansk University of Technology, G. Narutowicza Street 11/12, 80-233, Poland 4 Military Academy named after Eugene Bereznyak, 81 Y. Il'enka Str., Kyiv, 04050, Ukraine 5 Ivan Franko National University of Lviv, University Street, 1, Lviv, 79000, Ukraine Abstract The work establishes the main trends in determining the suitability of water for human consumption: the most common indicator of the acid-base balance of water is from 6 to 7, most of our data set are not suitable for drinking water, the most common indicator of the sulfate balance of water is from 300 to 350, the most common indicators of the carbon balance of water are within 12-15. The average and most popular value of the acid-alkaline balance of water is 7; the standard deviation from this parameter is insignificant, the indicators vary in the range of 0-14, and the sign of the acid-alkaline balance of water is quite stable. In this work, we constructed graphs in Cartesian and polar coordinate systems, derived quantitative characteristics of descriptive statistics, and formed histograms and cumulates. Investigating this problem, we used the main methods of visualization, graphic representation and primary statistical processing of numerical data. Methods of correlation analysis of experimental data presented by time sequences were also used in work. Keywords 1 Analysis method, determining, suitability, water, human consumption, cluster analysis, information technologies, intelligent analysis, system analysis, exponential smoothing, median filtering, data processing 1. Introduction The problem of determining the suitability of water for human consumption belongs to the goals of sustainable development and affects the development of human capital. The study of the impact of the quality of life on the sustainable development of countries was carried out in their works [1-4]. Authors [5-7] substantiated the role of the state in the preservation of natural resources, scientists [8-10] studied the importance of existing environmental protection systems [11, 12]. Also, well-known researchers [13-15] have developed methods for assessing damages from environmental pollution and their impact on the quality of life of the population. The volumes of water bodies and their quality affect their consumption by humans [16-20]. Everyone, people use water in one form or another. Water is in food, air and, accordingly, in substances. Nowhere without water. No matter how it sounds, a person is made up of 70% water. Water ensures the body's normal functioning; therefore, any violation of the use of MoMLeT+DS 2022: 4th International Workshop on Modern Machine Learning Technologies and Data Science, November, 25-26, 2022, Leiden-Lviv, The Netherlands-Ukraine. EMAIL Oleksiy_Tverdokhlib@gmail.com (O.Tverdokhlib); Denis_Shavaev@gmail.com (D. Shavaev); indeed.post@gmail.com (Y. Matseliukh); alex.gozhyj@gmail.com (A. Gozhyj); atrzaskowska@gmail.com (A. M. Trzaskowska); maks_kor@ukr.net (M. Korobchynskyi); Lyubomyr.Chyrun@lnu.edu.ua (L. Chyrun); irina.kalinina1612@gmail.com (I. Kalinina) ORCID: 0000-0002-7211-7370 (O.Tverdokhlib); 0000-0002-1707-1723 (D. Shavaev); 0000-0002-1721-7703 (Y. Matseliukh); 0000-0002- 3517- 580X (A. Gozhyj); 0000-0002-0911-945X (A. M. Trzaskowska); 0000-0001-8049-4730 (M. Korobchynskyi); 0000-0002-9448-1751 (L. Chyrun); 0000-0001-8359-2045 (I. Kalinina); ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) water in the diet leads to inevitable consequences and even fatalities. And when there is a lot of water, but it is of dubious quality (they usually do not drink it), people start water starvation; they cannot stand and drink this water, and as a result, they get serious diseases of the digestive system, which in the absence of normal medicine (for example, Africa or poor countries) leads to deaths [16-20]. Therefore, it is very important to correctly determine whether this or that water from a certain place is suitable for consumption. Our inputs are pH, Water Hardness, Solids, Chloramines, Sulphate, Conductivity, Organic Carbon, Trihalomethanes, Turbidity, and Potable. From the point of view of analysis, if some indicators exceed the norm too much, then even cleaning will not help here. And if the indicators are within the acceptable range, then it makes sense to attract investments, social projects, etc. [21-30]. These indicators affect the required reagents for water purification, which, in turn, affects the amount required to construct treatment facilities [31-42]. And if everything is normal, then why not inform the residents that the water is suitable for drinking, or it will be enough to boil it so as not to get an infection? And maybe this water is suitable for bottling in general; you need to remove the turbidity. Lack of water is a tragedy, especially with climate warming. When the water evaporates, it is clean, and as a result the percentage of pollution increases; add to these unscrupulous residents upstream who throw away everything they can get their hands on and we get a large-scale collapse. Therefore, this topic is more relevant than ever in the period of total pollution of the environment. 2. Related works The most common approaches to detecting and classifying water quality were found [16-53]. You can start with works [16-20] applied a CNN-LSTM amalgam model to predict two water quality variables, dissolved oxygen and chlorophyll. The results showed that the CNN-LSTM amalgam model outperformed the separate CNN and LSTM models. Authors [16-42] compared statistical methods, including Fuzzy Logic based on modern machine learning technology and different AI methods for development of similar systems as a component of a smart city [54-65]. Inference (FLI) and WQI for water quality assessment in the community of Ikare, Nigeria [35]. They identified moderate and poor water quality using FLI and WQI methods, respectively. They also found that the FLI method was superior to the WQI method because of the relationship between the measured and standard WQI values [43-53]. To estimate dissolved oxygen in aquaculture, authors [43] proposed a synthetic model. Although CNN-LSTM models and Sparse - autoencoder - LSTMs showed excellent performance because they only predicted DO and chlorophyll, it can be difficult to deal with more water quality variables using such models. In another study [44] authors applied Extra Tree Regression (ETR), which combines multi-week studies to predict WQI values in Tsuen River, Hong Kong. They applied the ETR method to ten water quality variables. The results showed that the ETR method achieved 98% prediction accuracy, outperforming other state-of-the-art models such as support vector regression and decision trees. A complete study on the application of methods for river water quality modelling was conducted by authors [45], where they reviewed 51 articles published between 2000 and 2016. According to this study, artificial neural networks and wavelet neural networks were the most widely used methods for water quality prediction. In addition, scientists [46] developed an artificial neural network. For this study, the most significant water quality parameters were found using spatial discriminant analysis (SDA). But in another study [16] these studies can barely show an accuracy of 71%. In the work [37] applied an artificial neural network to predict WQI in the Akaki River in Ethiopia. In this analysis, an artificial neural network with eight hidden layers and 15 hidden neurons predicts WQI with more than 90% accuracy. Also, authors [47] applied an artificial neural network with one hidden layer to predict the sustainability of water quality in São Paulo, Brazil. Applying neural networks for WQI prediction requires a large amount of water quality data, which is expensive and time-consuming. Researchers [41] applied a decision tree to classify water quality status in the Klang River, Malaysia. They considered three scenarios where; they used six water quality variables in the first scenario. They then removed water quality parameters such as NH3-N, pH, and SS during each procedure to evaluate the ability of the decision tree algorithm in different situations. They achieved classification accuracies of 84.09%, 81.82%, and 77.27% in each scenario, which are higher than the 75% classification accuracy comparison [39-41]. This study used 22 water quality samples, making the model computationally expensive. 3. Methods To solve the problems to be considered in this paper, we will use several standard methods, such as: 1. The moving average method [66, 67]. This method estimates the average level for a certain period. The longer the time interval to which the average belongs, the more smoothly the level will be smoothed, but the less accurately the trend of the original dynamics series will be described [68-70]. The moving average method is the simplest way of smoothing empirical curves. The essence of this method consists of replacing the indicator's actual values with their averaged values, which have a much smaller variation than the original levels of the series. Moving averages calculated for odd and even numbers of time intervals are distinguished depending on the averaging period [71, 72]. A more complex calculation scheme is used in cases where an even number of elements determines the moving average. The following algorithm is used for calculation. First, it is necessary to determine the length of the smoothing interval l, which includes l consecutive levels of the series (l < n) [73-75]. At the same time, it should be taken into account that the wider the smoothing interval, the greater the mutual fluctuations, and the trend of development has a smoother, smoother character. The stronger the oscillation, the wider the smoothing interval should be. Next, it is necessary to break down the entire period of observation at the site while the smoothing interval, as it were, slides along the row with a step equal to l. Calculate the arithmetic mean of the levels of the series forming each section. Replace the actual values of the row in the centre of each plot with the corresponding average values. The algorithm for calculating a simple moving average is as follows [76- 79]. The definition of the moving average in the case of an even number of levels in the moving interval is complicated by the fact that then the average should be attributed only to the middle between two moments located in the middle of the smoothing interval and at such a moment no observations were made. If the graphic representation of the time series resembles a straight line, then the moving average does not distort the dynamics of the studied phenomenon. 2. Weighted moving average method [66-70, 73, 80] A more subtle technique, based on the same idea as simple moving averages, is to use weighted moving averages. If, when applying a simple moving average, all levels of the series are recognized as equal, then when calculating the weighted average, each level within the smoothing interval is assigned a weight that depends on the distance measured from the given level to the middle of the smoothing interval. When building a weighted moving average on each active site, the value of the main level is replaced by the calculated one, calculated according to the formula of the weighted arithmetic average. In other words, a weighted moving average differs from a simple moving average because the levels included in the averaging interval are summed with different weights. A simple moving average takes into account all the series levels included in the smoothing interval with equal weights, and the weighted average assigns to each level a weight that depends on the distance of the given level to the level standing in the middle of this interval [66-70, 73, 81-82]. This is because for a simple moving average in the smoothing interval, calculations are performed based on a straight line - a polynomial of the first order, and for smoothing with a weighted moving average, polynomials of higher orders, preferably of the second or third order, are used. Therefore, the simple moving average method is possibly considered a special case of the weighted moving average method [66-72]. The calculation of the moving average is presented as a simple and safe operation with a completely clear meaning. However, this operation transforms the dynamic series to a greater extent than it seems at first glance. So, if the levels of the series were independent before the smoothing, then after this transformation, the successive calculated levels (within the smoothing interval) are somewhat dependent on each other. Indeed, each level of the smoothed series has a common part with several previous and subsequent members. The algorithm of smoothing with a weighted moving average with the size of the "window" - the smoothing interval w = 2k + 1, which is successively shifted along the series levels and averages the levels covered by it. The formula for calculation [66-72, 83-85]: 3. Correlation field [67, 86-88]. A correlation field is a graph that establishes a relationship between variables, where X of each corresponds to the abscissa value and Y to the ordinate value of a specific unit of observation. The number of points on the graph corresponds to the number of observation units. The placement of points shows the presence and direction of communication. To build a correlation field, you usually need to take the following steps: choose two variables that change over time. Then the value of the dependent variable is measured. As a result, the result is entered in the table. Then a coordinate grid is built, the value of the independent variable is indicated on the X axis, and the dependent variable is indicated on the Y axis. After that, you need to mark the points of the correlation field. On the X-axis for the first value of the independent variable, mark the point on the Y-axis corresponding to the value of the dependent variable. The obtained result is called the correlation field. Next, it is necessary to analyze the schedule and form a conclusion[67, 86-89]. a. Correlation coefficient. b. Correlation relationship. c. Correlation matrix. d. Autocorrelation. 4. Cluster analysis is one of the methods of multivariate statistical analysis; that is, each observation is represented not by a single indicator but by a set of values of various indicators [5, 86, 91-99]. It includes algorithms with the help of which the clusters' formation and the distribution of objects by clusters are carried out. Cluster analysis, first of all, solves the problem of adding structure to the data and also ensures the selection of groups of objects, that is, looks for the division of the population into areas of accumulation of objects. Cluster analysis allows you to consider fairly significant volumes of data, sharply shorten and compress them, make them compact 4. Experiments 4.1. Analysis of existing software products To begin with, we downloaded the dataset [89] and began familiarization with it. Fig.1 is what the original dataset looks like in Excel (our dataset are pH, Water hardness, Solids, Chloramines, Sulfate, Conductivity, Organic carbon, Trihalomethanes, Turbidity, and Potable): Figure 1. A selected dataset in excel pH is an important parameter for assessing the acid-alkaline balance of water. Water hardness is mainly due to calcium and magnesium salts [90]. These salts dissolve from geological deposits through which water moves. Solids - a wide range of inorganic and organic minerals or salts, such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates, etc., can dissolve in water. This is an important parameter for water use. Chloramines are the main disinfectants used in public water systems. Sulfates are naturally occurring substances found in minerals, soil and rocks. They are present in atmospheric air, underground water, plants and food products. Conductivity: Pure water is not a good conductor of electricity but a good insulator [90]. An increase in ion concentration increases the electrical conductivity of water. Total organic carbon in source waters comes from decaying natural organic matter and synthetic sources. Trihalomethanes are chemicals found in chlorinated water. The turbidity of water depends on the number of suspended solids. Potable indicates whether the water is safe for human consumption, where one means potable and 0 means non-potable [90]. Next, we loaded our dataset into the RStudio development environment: water <- read.csv( file ='D:/water_potability.csv') Fig. 2. Dataset view in RStudio Present a graphical presentation of the dataset. Fig. 3. Graph of dependence of ph level on Solids in water in Cartesian coordinates For visualization, we will use the ggplot2 library, which allows you to build beautiful graphs. First, install the library: install.packages ("ggplot2") Program code for plotting a graph of the dependence of the degree of acidity on solids in the usual Cartesian coordinates: library (ggplot2) plot1 <- ggplot () + geom_line ( aes (y = ph , x = Solids ), data = water ) plot1 + labs ( title = " Water quality ", x = " ph ", y = " Solids ") The polar coordinate system is most often used for pie charts, which are bar charts stacked in polar coordinates. To write the software code for plotting the graph of the degree of acidity versus organic carbon, we used coord_polar (): plot2 <- ggplot ( water , aes (x = ph , fill = Organic_carbon )) + geom_histogram ( binwidth = 15, boundary = -7.5) + coord_polar () + scale_x_continuous ( limits = c(0,360)) plot2 + labs ( title = " Water quality ", x = " ph ", y = " Organic_carbon ") Figure 4 shows the dependence of ph on organic_carbon in polar coordinates. Fig. 4. Graph of dependence of ph level on Organic_carbon in water in polar regions coordinates Fig. 5. Graph of dependence of ph and sulfate in Cartesian coordinates Water acidity to sulfate content: water_sorted_ph <- water [ order ( water$ph ), ] plot1 <- ggplot () + geom_line ( aes (y = ph , x = Sulfate ), data = water_sorted_ph ) plot1 + labs ( title = " Water quality ", x = " ph ", y = " Sulfate ") A histogram is a way of graphically presenting tabular data and their distribution. A histogram can be created using the hist () function in the R programming language. This function accepts a vector of values for which the histogram is constructed. This graph shows the dependence of ph (acidity) on solids. You can see that most of the data ranges from 15000 to 30000 for ph and 5 to 10 for solids. It can be concluded that most of the water from this dataset is not of the best quality, and in some places, it is very toxic. Fig. 6. ph indicator Program code for constructing a histogram of water acidity: library (ggplot2) hist ( water$ph , main =" Ph histogram ", xlab =" Ph ", col =" blue ") Similarly, the program code for building a histogram of water hardness: hist ( water$Hardness , main =" Hardness histogram ", xlab =" Hardness ", col =" blue ") It can be concluded that most of the water is not suitable for consumption because the indicators are too high. This histogram shows that the largest number of cases is in the interval 6-8, with about 500 cases in the interval 6-7. From Fig. 7, it can be seen that 1200 (60% of the entire sample) cases are unsuitable for use, and 800 are suitable. Fig. 7. Potability indicator histogram PerformanceAnalytics is a package of econometric functions for analyzing the performance and risks of financial instruments or portfolios. Let's try to determine some parameters of the pH indicator: Arithmetic means - the average value of the sample. Let's use the mean () method : library ( PerformanceAnalytics ) #Arithmetic mean seredne <- mean ( water$ph ) The median is the number that divides the set of sample numbers in half. Him median () method : #median median <- median ( water$ph ) Fig. 8. Research results Obtained results: • Average - 7.08599 • The standard error is 0.035 • Median - 7.027297 • Fashion - 8.316766 • The standard deviation is 1.573337 • Sample variance - 2.474157 ● Skewness - 0.6185764 • Asymmetry - 0.04891027 • Interval - 13.7725 • The minimum is 0.23 • The maximum is 14 • The amount is 14249.93 • Volume (quantity) - 2011 • Coefficient of variation - 22.2% A more detailed analysis of the data can be found in the Discussion section. Standard error is the deviation of the sample from the actual mean. Let's use the std() method : #standard error std <- function (x) sd (x)/ sqrt ( length (x)) standartna_pomylka <- std ( water$ph ) Mode is the number that occurs most often in the sample. Let's use the function getmode(): #fashion getmode <- function (v) { uniqv <- unique (v) uniqv [ which.max ( tabulate ( match (v, uniqv )))] } mode <- getmode ( water$ph ) Standard deviation is the amount of spread relative to the arithmetic mean. To search, we use the sd() method : deviation standartne_vidchylennya <- sd ( water$ph ) Variance is an estimate of the theoretical variance of the distribution based on the sample: #dispersion D <- 0 for ( ph in water$ph ) { D <- D + ( ph-mean ( water$ph ))**2 } Dyspersia <-(D / nrow ( water )) Skewness is a parameter that reflects the height of the distribution. We will use the moments library and the kurtosis () method : #kurtosis excess <- kurtosis ( water$ph ) Asymmetry reflects the skewness of the distribution relative to the mode. skewness () method : #asymmetry asumetrychnist <- skewness ( water$ph ) The interval is the difference between the minimum and maximum value of the sample: #interval interval <-( max ( water$ph ) - min ( water$ph )) Minimum - the smallest value of the sample #minimum minimum <- min ( water$ph ) Maximum - the largest sample value: #maxymum maxsymum <- max ( water$ph ) Sum of all sample values: #sum suma < - sum ( water$ph ) Total number of columns with data: #sample size Nradkiw <- nrow ( water ) The coefficient of variation is an indicator that determines the percentage ratio of the average deviation to the average value: #coefficient of variation coef_variacii <-( sd (( water$ph )) / mean (( water$ph )) * 100) Cumulants are a representation of the distribution in the form of a curve, the ordinates of which are proportional to the accumulated frequencies of the variation series. To make a series of accumulated frequencies, you need to add the frequency of the second class to the frequency of the first, smallest class, then add the frequency of the third class, etc. Cumulative sometimes have an advantage over the variation curve. ph = water$ph breaks = seq (0, 14, by =0.1) ph.cut = cut ( ph , breaks , right =FALSE) ph.freq = table ( ph.cut ) cumfreq0 = c(0, cumsum ( ph.freq )) plot ( breaks , cumfreq0, main =" ph ", xlab =" ph ", ylab =" Number") lines ( breaks , cumfreq0) Hardness = water$Hardness breaks = seq (74, 315, by =1) Hardness.cut = cut ( Hardness , breaks , right =FALSE) Hardness.freq = table ( Hardness.cut ) cumfreq0 = c(0, cumsum (Hardness.freq )) plot ( breaks , cumfreq0, main =" Hardnes ", xlab =" Hardness ", ylab =" Number") lines ( breaks , cumfreq0) A cumulant is a continuous curve graphically depicted in a coordinate system, where the value of the characters or the limits of its intervals is indicated on the abscissa axis, and the increasing sum of frequencies is indicated on the ordinate axis. Fig. 9. Cumulative indicators of ph Fig. 10. Hardness indicator Having analyzed the created cumulates, we can conclude that all indicators have a sharp increase in pollution, which correlates with the increase in the numerical value of the parameters, that is, more water with higher indicators. 5. Results and discussion Smoothing methods reduce the influence of the random component (random fluctuations) in time series [66-90]. They make it possible to obtain more "pure" values, which consist only of deterministic components. Some methods aim to highlight only some components, for example, a trend. We will perform smoothing using different methods. We will use the following libraries: • library (tidyverse); • library (lubridate); • library (fpp2); • library (zoo); • library (pastecs); • library (TTR). We import and number the data: water <- read.csv( file ='D:/water_potability1.csv') id <- c(1:3276) water <- cbind ( id , water ) 1. The moving average method [67]. We will use Kendel's formulas for smoothing according to the moving average. The method is often used for statistical evaluation in statistical hypothesis testing to determine whether two variables can be considered statistically dependent. Under the null hypothesis of independence of X and Y, the sampling distribution τ has an expected value of zero. The exact distribution cannot be characterized in terms of joint distributions but can be calculated for small samples; for larger samples, it is common to use the approximation for a normal distribution with a mathematical expectation equal to zero and a random variable variance. We will smooth our data by the following sizes of the smoothing interval w = 3, 5, 7, 9, 11, 13, 15 to obtain seven bars using the rollmean () function: ma <- water %>% select ( id , Hardness ) %>% mutate (ma1 = rollmean ( Hardness , k = 3, fill = NA), ma2 = rollmean ( Hardness , k = 5, fill = NA), ma3 = rollmean ( Hardness , k = 7, fill = NA), ma4 = rollmean ( Hardness , k = 9, fill = NA), ma5 = rollmean ( Hardness , k = 11, fill = NA), ma6 = rollmean ( Hardness , k = 13, fill = NA), ma7 = rollmean ( Hardness , k = 15, fill = NA)) Next, we visualize the data: ma %>% gather ( metric , Hardness , Hardness:ma7) %>% ggplot ( aes ( id , Hardness , color = metric )) + geom_line () ma1 = rollmean ( water$Hardness , k = 3) ma2 = rollmean ( water$Hardness , k = 5) ma3 = rollmean ( water$Hardness , k = 7) ma4 = rollmean ( water$Hardness , k = 9) ma5 = rollmean ( water$Hardness , k = 11) ma6 = rollmean ( water$Hardness , k = 13) ma7 = rollmean ( water$Hardness , k = 15) Search for turning points: tp1 <- turnpoints (ma1) summary (tp1) tp2 <- turnpoints (ma2) summary (tp2) tp3 <- turnpoints (ma3) summary (tp3) tp4 <- turnpoints (ma4) summary (tp4) tp5 < - turnpoints (ma5) summary (tp5) tp6 <- turnpoints (ma6) summary (tp6) tp7 <- turnpoints (ma7) summary (tp7) Visualization of turning points for 7 distribution: plot (tp7) plot (ma7, type = "l") lines (tp7) We are looking for correlation coefficients of the smoothed values with the original ones, taking into account that with each smoothing, subtract the columns: cor ( water$Hardness [2:3275],ma1) cor ( water$Hardness [3:3274],ma2) cor ( water$Hardness [4:3273],ma3) cor ( water$Hardness [5:3272], ma4) cor ( water$Hardness [6:3271],ma5) cor ( water$Hardness [7:3270],ma6) cor ( water$Hardness [8:3269],ma7) We smooth the data using the size of the smoothing interval w = 3, then we smooth the obtained smoothed data again, but use the size of the smoothing interval w = 5. Continue the smoothing of the received data with the smoothing interval w = 7 and so on until w = 15. We should get seven columns in a row: maRecursive <- water %>% select ( id , Hardness ) %>% mutate (ma1 = rollmean ( Hardness , k = 3, fill = NA), ma2 = rollmean (ma1, k = 5, fill = NA), ma3 = rollmean (ma2, k = 7, fill = NA), ma4 = rollmean (ma3, k = 9, fill = NA), ma5 = rollmean (ma4, k = 11, fill = NA), ma6 = rollmean (ma5, k = 13, fill = NA), ma7 = rollmean (ma6, k = 15, fill = NA)) We smooth the data using the sizes of the smoothing interval w = 3, 5, 7, 9, 11, 13, 15 to obtain seven columns. In order to build a moving average. we took as parasetters hardness and id of each water record. You can see 7 columns based on given intervals. Fig. 11. Smoothed data according to formulas from Kendel Visualization of smoothing: maRecursive %>% gather ( metric , Hardness , Hardness:ma7) %>% ggplot ( aes ( id , Hardness , color = metric )) + geom_line () maR1 = maRecursive$ma1[!is.na(maRecursive$ma1)] maR2 = maRecursive$ma2[!is.na(maRecursive$ma2)] maR3 = maRecursive$ma3[!is.na(maRecursive$ma3)] maR4 = maRecursive$ma4[!is.na(maRecursive$ma4)] maR5 = maRecursive$ma5[!is.na(maRecursive$ma5)] maR6 = maRecursive$ma6[!is.na(maRecursive$ma6)] maR7 = maRecursive$ma7[!is.na(maRecursive$ma7)] Search for turning points: tpR1 <- turnpoints (maR1) summary (tpR1) tpR2 <- turnpoints (maR2) summary (tpR2) tpR3 <- turnpoints (maR3) summary (tpR3) tpR4 <- turnpoints (maR4) summary (tpR4) tpR5 < - turnpoints (maR5) summary (tpR5) tpR6 <- turnpoints (maR6) summary (tpR6) tpR7 <- turnpoints (maR7) summary (tpR7) Visualization of turning points: plot (tpR7) plot (maR7, type = "l") lines (tpR7) We are looking for correlation coefficients of the smoothed values with the original ones, taking into account that with each smoothing subtract the columns: cor ( water$Hardness [2:3275],maR1) cor ( water$Hardness [4:3273],maR2) cor ( water$Hardness [7:3270],maR3) cor ( water$Hardness [11:3266], maR4) cor ( water$Hardness [16:3261],maR5) cor ( water$Hardness [22:3255],maR6) cor ( water$Hardness [29:3248],maR7) Fig. 12. Graphic representation of smoothed data From this graph, you can see the hardness parameter fluctuations over the entire interval. The main thing here is hardness and ma7. we see that there is a certain trend here. It's hard to see from the graph, but the end result is a more smooth description of the data. Fig. 13. Visualization of turning points The turning points are quite numerous and detailed smoothing interval increases, the correlation coefficient decreases, because the data is increasingly modified. Fig. 14. Correlation coefficients between smoothed and original data We smooth the data using the size of the smoothing interval w = 3; then we smooth the obtained smoothed data again using the size of the smoothing interval w = 5. We continue the smoothing of the received data with the smoothing interval w = 7 and so on until w = 15. Fig. 15. Smoothed data according to formulas from Kendel It can be seen that we lost more rows and got less accurate data. Fig. 16. Graphic representation of smoothed data Fig. 17. Turning points at the smoothing interval w = 15 Fig. 18. Visualization of turning points Fig. 19. Correlation coefficients between smoothed and original data The correlation coefficients also differ, but not much, so the relationship with the raw data remains approximately the same. 2. Median smoothing [67]. The content of the time series's median smoothing algorithm consists of the median's defined values for the smoothing interval levels. Next, the time series level value corresponding to the middle of the smoothing interval is replaced by the median value. Median smoothing completely removes single extreme or anomalous values of levels that are separated from each other by at least half of the smoothing interval; preserves sharp changes in the trend (moving average and exponential smoothing smooth them); effectively removes single levels with very large or very small values that are random and stand out sharply from other levels. We smooth the data using the sizes of the smoothing interval w = 3, 5, 7, 9, 11, 13, 15 to obtain seven columns using the runmed() function: ms <- water %>% select ( id , Hardness ) %>% mutate (ms1 = runmed ( Hardness , 3), ms2 = runmed ( Hardness , 5), ms3 = runmed ( Hardness , 7), ms4 = runmed ( Hardness, 9), ms5 = runmed (Hardness, 11), ms6 = runmed (Hardness, 13), ms7 = runmed (Hardness , 15)) Fig. 20. Median smoothed data We used the same smoothing intervals and operations as in the previous point. Fig. 21. Graphic representation of smoothed data Visualization of smoothing: ms %>% gather ( metric , Hardness , Hardness:ms7) %>% ggplot ( aes ( id , Hardness , color = metric )) + geom_line () Turnpoints search: tp1 <- turnpoints (ms$ms1) summary (tp1) tp2 <- turnpoints (ms$ms2) summary (tp2) tp3 <- turnpoints (ms$ms3) summary (tp3) tp4 <- turnpoints (ms$ms4) summary (tp4) tp5 <- turnpoints (ms$ms5) summary (tp5) tp6 <- turnpoints (ms$ms6) summary (tp6) tp7 <- turnpoints (ms$ms7) summary (tp7) Visualization of turning points: plot (tp7) plot (ms$ms7, type = "l") lines (tp7) Now let's find the turning points for the last smoothing with step 15: Fig. 21. Turning points at the smoothing interval w = 15 Fig. 22. Visualization of turning points Correlation coefficients of smoothed values with original ones: cor (water$Hardness,ms$ms1) cor (water$Hardness,ms$ms2) cor (water$Hardness,ms$ms3) cor (water$Hardness,ms$ms4) cor (water$Hardness,ms$ms5) cor (water$Hardness,ms$ms6) cor (water$Hardness,ms$ms7) We smooth the data using the size of the smoothing interval w = 3, then we smooth the obtained smoothed data again, but use the size of the smoothing interval w = 5. Continue the smoothing of the received data with the smoothing interval w = 7 and so on until w = 15. We should get seven columns in a row: msR <- water %>% select ( id , Hardness ) %>% mutate (ms1 = runmed ( Hardness , 3), ms2 = runmed (ms1, 5), ms3 = runmed (ms2, 7), ms4 = runmed (ms3 , 9), ms5 = runmed (ms4, 11), ms6 = runmed (ms5, 13), ms7 = runmed (ms6, 15)) Visualization of smoothing: msR %>% gather ( metric , Hardness , Hardness:ms7) %>% ggplot ( aes ( id , Hardness , color = metric )) + geom_line () Turnpoints search: tp1 <- turnpoints (msR$ms1) summary (tp1) tp2 <- turnpoints (msR$ms2) summary (tp2) tp3 <- turnpoints (msR$ms3) summary (tp3) tp4 <- turnpoints (msR$ms4) summary (tp4) tp5 <- turnpoints (msR$ms5) summary (tp5) tp6 <- turnpoints (msR$ms6) summary (tp6) tp7 <- turnpoints (msR$ms7) summary ( tp7) Visualization of turning points: plot (tp7) plot (msR$ms7, type = "l") lines (tp7) Correlation coefficients of smoothed values with original ones: cor (water$Hardness,msR$ms1) cor (water$Hardness,msR$ms2) cor (water$Hardness,msR$ms3) cor (water$Hardness,msR$ms4) cor (water$Hardness,msR$ms5) cor (water$Hardness,msR$ms6) cor (water$Hardness,msR$ms7) The graph looks exactly like this because the data has acquired a complete form. Fig. 23. Correlation coefficients between smoothed and original data The correlation coefficient is smaller than the data of the previous methods, which means that this method is not quite suitable for the given dataset because it reduces its reliability. Correlation analysis [66-80] is a group of methods that allow detecting the presence and degree of relationship between several randomly changing parameters. Special numerical characteristics and their statistics assess the degree of such a relationship. The correlation appears in the form of a tendency to change the average values of the function depending on changes in the argument. ggpubr library - it is a library for data visualization in R. We build a correlation field: library ( ggpubr ) plot ( water$ph , water$Solids , main =" Correlation field ", xlab =" Age ", ylab = " Cholesterol ") From the graphically presented field, it can be concluded that the indicators correlate quite strongly [55]. Fig. 24. Correlation field We determine the correlation coefficient: correlation <- cor ( water$ph , water$Solids ) Using the ggscatter method of the ggrubr library , we calculate correlation relation: qwe <- ggscatter ( water , x = " ph ", y = " Solids ", add = " reg.line ", conf.int = TRUE, cor.coef = TRUE, cor.method = " person ", xlab = " ph ", ylab = " Solids ") We divide the data into 3 parts: ph1 <- water$ph [1:1092] ph2 <- water$ph [1093:2184] ph3 <- water$ph [2185:3276] For parts, we build a correlation matrix ( rcorr ): mydata.rcorr = rcorr ( as.matrix ( cbind (ph1, ph2, ph3))) We find multiple correlation coefficients: numericData <- cbind ( water$id,water$ph , water$Hardness , water$Solids , water$Chloramines , water$Sulfate,water$Conductivity,water$Organic_carbon,water$Trihalome thanes,water$Turbidity ) chart.Correlation ( numericData , histogram =TRUE, pch =19) Let's plot graphs of autocorrelation functions using acf : data <- cbind ( water$ph , water$Solids ) colnames ( data ) <- c(" ph ", " Solids ") autocorrelation <- acf ( data , lag.max = 1, type = c(" correlation "), plot = TRUE, xlab =" ph ", ylab =" Solids ") Fig. 25. Correlation matrix The matrix displays all the coefficients and even graphically displays the relationships. Multiple correlation coefficients show that the dataset has weak but present relationships, based on which results can be constructed. Cluster analysis is one of the methods of multivariate statistical analysis; that is, each observation is represented not by a single indicator but by a set of values of various indicators [5, 86, 91-99]. It includes algorithms with the help of which the clusters' formation and the distribution of objects by clusters are carried out. Cluster analysis, first of all, solves the problem of adding structure to the data and also ensures the selection of groups of objects, that is, looks for the division of the population into areas of accumulation of objects. Cluster analysis allows you to consider fairly significant volumes of data, sharply shorten and compress them, make them compact. Fig. 26. Graphic representation of cluster analysis Because we use the RStudio environment and the R language to perform the laboratory work in order to build clusters, it is not necessary to form an "object-property" table from the provided data, to form from the closely located "original table" and "table-copy", to build a proximity matrix and the like. We can immediately perform the cluster analysis procedure. Performing a cluster analysis procedure using built-in R methods: Let's select the parameters MaxHR, Cholesterol and ChestPainType and build a graphical representation of the clustering: factoextra - The library provides some easy-to-use functions to extract and visualize the results of multivariate data analysis. library (ggplot2) library ( factoextra ) library ( rEMM ) ggplot ( water , aes ( ph , Solids , col = Hardness )) + geom_point () Let's build the clustering matrix: set.seed (55) cluster <- kmeans ( cbind ( water$ ph , water$Solids ), 3, nstart = 10) cluster table ( cluster$cluster,water$Hardness ) build a dendrogram : data <- cbind ( water$ph , water$Solids ) data.hclust =hclust(dist(scale(data,center=apply(data,2,mean),scale=apply(data,2,sd)))) plot ( data.hclust ) We chose the parameters Solids, Hardness and built a graphical representation of the clustering: 6. Conclusions The work establishes the main trends in determining the suitability of water for human consumption: the most common indicator of the acid-base balance of water is from 6 to 7, most of our data set are not suitable for drinking water, the most common indicator of the sulfate balance of water is from 300 to 350, the most common indicators of the carbon balance of water are within 12-15. The average and most popular value of the acid-alkaline balance of water is 7; the standard deviation from this parameter is insignificant, the indicators vary in the range of 0-14, and the sign of the acid-alkaline balance of water is quite stable. In this work, we constructed graphs in Cartesian and polar coordinate systems, derived quantitative characteristics of descriptive statistics, and formed histograms and cumulates. Investigating this problem, we used the main methods of visualization, graphic representation and primary statistical processing of numerical data. Methods of correlation analysis of experimental data presented by time sequences were also used in work. The most common indicator values determined by histograms: • The most common indicator of the acid-alkaline balance of water is from 6 to 7; • Most of our data set are non-potable water; • The most common indicator of the sulfate balance of water is from 300 to 350; • The most common indicators of the carbon balance of water are in the range of 12-15. As can be seen from the histogram in Fig. 33, most of the studied water from our dataset is unsuitable for consumption (more than 1200 records). The results of the descriptive statistics of the level of acidity are the following data: • Average is 7.08599; • Asymmetry is 0.04891027; • The standard error is 0.035; • Interval is 13.7725; • Median is 7.027297; • The minimum is 0.23; • Fashion is 8.316766; • The maximum is 14; • The standard deviation is 1.573337; • The amount is 14249.93; • Sample variance is 2.474157; • Volume (quantity) is 2011; • Skewness is 0.6185764; • Coefficient of variation is 22.2%. After finding some statistical data for the water acidity level, we saw that this level ranges from 5 to 9. The level of acidity should be in the range of 6.5 - 8.5. We see an average value of 7, which is within these limits; the standard error is relatively small. The median also falls within these limits. We see a minimum of 0.23, which is completely abnormal and can almost be equated to car battery acid, and a maximum of 14, which can be equated to soapy water. The difference between the maximum and the minimum is the indicator - the interval, which in our case is 13.7725. Consider the indicator - kurtosis. For a normal distribution, the kurtosis is zero. If the kurtosis of some distribution is different from zero, then this distribution's density curve differs from the normal distribution's density curve. Since our kurtosis is positive, the theoretical curve has a higher and "sharper" peak than the normal curve. Otherwise, this curve would have a theoretically lower and flatter peak than the normal curve. The value of the variation parameter can provide interesting information - this is the difference in the numerical values of the characteristics of the population units and their fluctuations around the average value that characterizes the population. The smaller the variation, the more homogeneous the population and the more reliable (typical) the average value. If the variation percentage is lower than 33%, then the data set is quantitatively homogeneous, which corresponds to our result of 22.2%. You can also form certain facts based on our results: • The average and most popular value of the acid-alkaline balance of water is 7; • The standard deviation from this parameter is insignificant; • Indicators range from 0.23 to 14; • The sign that the acid-alkaline balance of water is quite stable. 7. References [1] O. Kuzmin, M. Bublyk, A. Shakhno, O. Korolenko, H. Lashkun, Innovative development of human capital in the conditions of globalization, E3S Web of Conferences 166 (2020) 13011. [2] O. Ilyash, O. Yildirim, D. Doroshkevych, L. Smoliar, T. Vasyltsiv, R. Lupak, Evaluation of enterprise investment attractiveness under circumstances of economic development, Bulletin of Geography. Socio-economic Series 47 (2020) 95-113. http://doi.org/10.2478/bog-2020-0006. [3] I. Jonek-Kowalska, Housing Infrastructure as a Determinant of Quality of Life in Selected Polish Smart Cities Smart Cities 5(3) (2022) 924–946. [4] O. Maslak, V. Danylko, M. Skliar, Automation and Digitalization of Quality Cost Management of Power Engineering Enterprises, in: Proceedings of the 25th IEEE International Conference on Problems of Automated Electric Drive. Theory and Practice, PAEP 2020. https://doi.org/10.1109/MEES52427.2021.9598744 [5] M. Bublyk, A. Kowalska-Styczen, V. Lytvyn, V. Vysotska, The Ukrainian Economy Transformation into the Circular Based on Fuzzy-Logic Cluster Analysis, Energies 14 (2021) 5951. doi: https://doi.org/10.3390/en14185951. [6] R. Yurynets, Z. Yurynets, O. Budіakova, L. Gnylianska, M. Kokhan, Innovation and Investment Factors in the State Strategic Management of Social and Economic Development of the Country. Modeling and Forecasting, CEUR Workshop Proceedings Vol-2917 (2021) 357-372. [7] Y. Matseliukh, V. Vysotska, M. Bublyk, T. Kopach, O. Korolenko, Network modelling of resource consumption intensities in human capital management in digital business enterprises by the critical path method, CEUR Workshop Proceedings Vol-2851 (2021) 366–380. [8] I. Jonek-Kowalska, Towards the Reduction of CO2 Emissions. Paths of Pro-Ecological Transformation of Energy Mixes in European Countries with an Above-Average Share of Coal in Energy Consumption. Resources Policy 77 (2022). doi: 10.1016/j.resourpol.2022.102701. [9] M. Bublyk, V. Vysotska, Y. Matseliukh, V. Mayik, M. Nashkerska, Assessing losses of human capital due to man-made pollution caused by emergencies, CEUR Workshop Proceedings Vol-2805 (2020) 74-86. [10] M. Bublyk, Y. Matseliukh, Small-batteries utilization analysis based on mathematical statistics methods in challenges of circular economy, CEUR workshop proceedings Vol-2870 (2021) 1594- 1603. [11] I. Jonek-Kowalska, R. Wolniak, Economic opportunities for creating smart cities in Poland. Does wealth matter?, Cities114 (2021) 103222. [12] I. Rishnyak, O. Veres, V. Lytvyn, M. Bublyk, I. Karpov, V. Vysotska, V. Panasyuk, Implementation models application for IT project risk management, CEUR Workshop Proceedings Vol-2805 (2020) 102-117. [13] O. Kuzmin, M. Bublyk, Economic evaluation and government regulation of technogenic (Man- Made) damage in the national economy, in: International Scientific and Technical Conference on Computer Sciences and Information Technologies, 2016, pp. 37–39. [14] R. Wolniak, I. Jonek-Kowalska, The level of the quality of life in the city and its monitoring Innovation, The European Journal of Social Science Research 34(3) (2021) 376–398. [15] T. Vasyltsiv, I. Irtyshcheva, R. Lupak, N. Popadynets, Y. Shyshkova, Y. Boiko, O. Ishchenko, Economy's innovative technological competitiveness: Decomposition, methodic of analysis and priorities of public policy, Management Science Letters 10(13) (2020) 3173-3182. https://doi.org/10.5267/j.msl.2020.5.004. [16] M. S. I. Khan, N. Islam, J. Uddin, S. Islam, M. K. Nasir, Water quality prediction and classification based on principal component regression and gradient boosting classifier approach, Journal of King Saud University-Computer and Information Sciences 34(8) (2022) 4773-4781. https://doi.org/10.1016/j.jksuci.2021.06.003. [17] T. H. H Aldhyani, M. Al-Yaari, H. Alkahtani, M. Maashi, Water Quality Prediction Using Artificial Intelligence Algorithms, Applied Bionics and Biomechanics 2020, Article ID 6659314, 12 p., 2020. https://doi.org/10.1155/2020/6659314. [18] S. Chatterjee, S. Sarkar, N. Dey, S. Sen, T. Goto, N. C. Debnath, Water quality prediction: Multi objective genetic algorithm coupled artificial neural network based approach, in: Int. Conf. on Industrial Informatics, 2017, pp. 963-968. https://ieeexplore.ieee.org/document/8104902. [19] Water quality describes the condition of the water, including chemical, physical, and biological characteristics, usually with respect to its suitability for a particular purpose such as drinking or swimming. URL: https://floridakeys.noaa.gov/ocean/waterquality.html. [20] Importance of Water Quality and Testing. URL: https://www.cdc.gov/healthywater/drinking/public/water_quality.html. [21] A. N. Ahmed, et al. Machine learning methods for better water quality prediction, Journal of Hydrology 578 (2019) 124084. [22] Y. Chen, L. Song, Y. Liu, L. Yang, D. Li, A review of the artificial neural network models for water quality prediction. Applied Sciences 10(17) (2020) 5776. [23] A. Gozhyj, I. Kalinina, V. Gozhyj, Fuzzy cognitive analysis and modeling of water quality, in: International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), 2017, pp. 289-294. [24] M. Linan, B. Gerardo, R. Medina, Self-Organizing Map with Nguyen-Widrow Initialization Algorithm for Groundwater Vulnerability Assessment, International Journal of Computing 19(1) (2020) 63-69. [25] D.K. Mozgovoy, V.V. Hnatushenko, V.V. Vasyliev, Automated recognition of vegetation and water bodies on theterritory of megacities in satellite images of visible and IR bands, ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci. IV-3 (2018) 167–172, https://doi.org/10.5194/isprs- annals-IV-3-167-2018. [26] W. Wójcik, et. al., Hydroecological investigations of water objects located on urban areas, in: Environmental Engineering V – Proceedings of the 5th National Congress of Environmental Engineering, 2017, pp. 155–160. [27] R.Ya. Kosarevich, et. al., Assessment of damages caused by thermal fatigue cracks in water economizer collector, Fiziko-Khimicheskaya Mekhanika Materialov 40(1) (2004) 109–115. [28] O. Alokhina, et. al., Solar Activity and Water Content of Closed Lake Ecosystems, in: General Assembly and Scientific Symposium of the International Union of Radio Science, 2020, 9232274. [29] N. Anufrieva, Y. Obukh, B. Rusyn, I. Fartushok, Expert computer system for technical diagnostics of the efficiency of main constitutive elements of the water steam route, in: The Experience of Designing and Application of CAD Systems in Microelectronics - Proceedings of the 9th International Conference, CADSM, 2007, pp. 206. [30] N. Anufrieva, Y. Obukh, B. Rusyn, I. Fartushok, Typical damage image database of the main constitutive elements of the water steam route, in: The Experience of Designing and Application of CAD Systems in Microelectronics - Proceedings of the International Conference, 2007, pp. 518. [31] R Elmahdi. Predicting Water Quality Variables. URL: https://scholar.sun.ac.za/bitstream/handle/10019.1/108072/elmahdi_predicting_2020.pdf?sequenc e=2&isAllowed=y. [32] V. Sagan, K. T. Peterson, M. Maimaitijiang, P. Sidike, J. Sloan, B. A Greeling, S. Maalouf, C. Adams, Monitoring inland water quality using remote sensing: potential and limitations of spectral indices, bio-optical simulations, machine learning, and cloud computing. URL: https://www.sciencedirect.com/science/article/abs/pii/S0012825220302336. [33] T. S. Kapalanga, Z. Hoko, W. Gumindoga, L. Chikwiramakomo, Remote-sensing-based algorithms for water quality monitoring in Olushandja Dam, north-central Namibia, Water Supply 21(5) (2021) 1878-1894. [34] Y. F. Zhang, P. J. Thorburn, M. P. Vilas, P. Fitch, Machine learning approaches to improve and predict water quality data, in: International Congress on Modelling and Simulation-Supporting Evidence-Based Decision Making: the Role of Modelling and Simulation, MODSIM 2019. [35] J. O., Oladipo, A. S., Akinwumiju, O. S., Aboyeji, A. A. Adelodun, Comparison between fuzzy logic and water quality index methods: A case of water quality assessment in Ikare community, Southwestern Nigeria, Environmental Challenges 3 (2021) 100038. [36] O. S. Aboyeji, S. F. Eigbokhan, Evaluations of groundwater contamination by leachates around Olusosun open dumpsite in Lagos metropolis, southwest Nigeria, Journal of environmental management 183 (2016) 333-341. [37] M. Yilma, Z. Kiflie, A. Windsperger, N. Gessese, Application of artificial neural network in water quality index prediction: a case study in Little Akaki River, Addis Ababa, Ethiopia, Modeling Earth Systems and Environment 4(1) (2018) 175-187. [38] D. M. Bushero, Z. A. Angello, B. M. Behailu, Evaluation of hydrochemistry and identification of pollution hotspots of little Akaki river using integrated water quality index and GIS, Environmental Challenges 8 (2022) 100587. [39] M. F. M. Nasir, M. S. Samsudin, I. Mohamad, M. R. A. Awaluddin, M. A. Mansor, H. Juahir, N. Ramli, River water quality modeling using combined principle component analysis (PCA) and multiple linear regressions (MLR): a case study at Klang River, Malaysia, World Applied Sciences Journal 14 (2011) 73-82. [40] M. Hameed, S. S. Sharqi, Z. M. Yaseen, H. A. Afan, A. Hussain, A. Elshafie, Application of artificial intelligence (AI) techniques in water quality index prediction: a case study in tropical region, Malaysia, Neural Computing and Applications 28(1) (2017) 893-905. [41] J. Y. Ho, et. al., Towards a time and cost effective approach to water quality index class prediction, Journal of Hydrology 575 (2019) 148-165. [42] R. Barzegar, M. T. Aalami, J. Adamowski, Short-term water quality variable prediction using a hybrid CNN–LSTM deep learning model, Stochastic Environmental Research and Risk Assessment 34(2) (2020) 415-433. [43] Z. Li, F. Peng, B. Niu, G. Li, J. Wu, Z. Miao, Water quality prediction model combining sparse auto-encoder and LSTM network, in: IFAC-PapersOnLine 51(17) (2018) 831-836. [44] S. B. H. S. Asadollah, A. Sharafati, D. Motta, Z. M. Yaseen, River water quality index prediction and uncertainty analysis: A comparative study of machine learning models, Journal of environmental chemical engineering 9(1) (2021) 104599. [45] T. Rajaee, S. Khani, M. Ravansalar, Artificial intelligence-based single and hybrid models for prediction of water quality in rivers: A review, Chemometrics and Intelligent Laboratory Systems 200 (2020) 103978. [46] M. S. Samsudin, A. Azid, S. I. Khalit, M. S. A. Sani, F. Lananan, Comparison of prediction model using spatial discriminant analysis for marine water quality index in mangrove estuarine zones, Marine pollution bulletin 141 (2019) 472-481. [47] M. Imani, M. M. Hasan, L. F. Bittencourt, K. McClymont, Z. Kapelan, A novel machine learning application: Water quality resilience prediction Model, Science of the Total Environment 768 (2021) 144459. [48] M. Ranjithkumar, L. Robert, Machine Learning Techniques and Cloud Computing to Estimate River Water Quality-Survey, Inventive communication and computational technologies, Springer, Singapore, 2021, p. 387-396. [49] Y. Trach, R. Trach, M. Kalenik, E. Koda, A. Podlasek, A Study of Dispersed, Thermally Activated Limestone from Ukraine for the Safe Liming of Water Using ANN Models, Energies 14(24) (2021) 8377. [50] Y. Trach, D. Chernyshev, O. Biedunkova, V. Moshynskyi, R. Trach, I. Statnyk, Modeling of Water Quality in West Ukrainian Rivers Based on Fluctuating Asymmetry of the Fish Population, Water 14(21) (2022) 3511. [51] L. V. Hryhorenko, Drinking water quality influence to the peasants’ morbidity in the Ukrainian settlements, International Journal of Statistical Distributions and Applications 3(3) (2017) 38-46. [52] J. Ober, J. Karwot, S. Rusakov, Tap Water Quality and Habits of Its Use: A Comparative Analysis in Poland and Ukraine, Energies 15(3) (2022) 981. [53] B. Polishchuk, A. Berko, L. Chyrun, M. Bublyk, V. Schuchmann, The Rain Prediction in Australia Based Big Data Analysis and Machine Learning Technology, in: International Scientific and Technical Conference on Computer Sciences and Information Technologies, 2021, pp. 97–100. [54] D. Koshtura, M. Bublyk, Y. Matseliukh, D. Dosyn, L. Chyrun, O. Lozynska, I. Karpov, I. Peleshchak, M. Maslak, O. Sachenko, Analysis of the demand for bicycle use in a smart city based on machine learning, CEUR workshop proceedings Vol-2631 (2020) 172-183. [55] A. Katrenko, I. Krislata, O. Veres, O. Oborska, T. Basyuk, A. Vasyliuk, I. Rishnyak, N. Demyanovskyi, O. Meh, Development of traffic flows and smart parking system for smart city. CEUR Workshop Proceedings Vol-2604 (2020) 730–745. [56] V. V. Lytvyn, M. I. Bublyk, V. A. Vysotska, Y. R. Matseliukh, Technology of visual simulation of passenger flows in the field of public transport Smart City, Radioelectronics, informatics, management, No. 4, 2021. [57] L. Podlesna, M. Bublyk, I. Grybyk, Y. Matseliukh, Y. Burov, P. Kravets, O. Lozynska, I. Karpov, I. Peleshchak, R. Peleshchak, Optimization model of the buses number on the route based on queueing theory in a Smart City, CEUR Workshop Proceedings Vol-2631 (2020) 502-515. [58] Y. Matseliukh, M. Bublyk, V. Vysotska, Development of intelligent system for visual passenger flows simulation of public transport in smart city based on neural network, CEUR Workshop Proceedings, Vol-2870 (2021). [59] V. Husak, L. Chyrun, Y. Matseliukh, A. Gozhyj, R. Nanivskyi, M. Luchko, Intelligent Real- Time Vehicle Tracking Information System, CEUR Workshop Proceedings 2917 (2021) 666-698. [60] V. Lytvyn, A. Hryhorovych, V. Hryhorovych, L. Chyrun, V. Vysotska, M. Bublyk, Medical Content Processing in Intelligent System of District Therapist, CEUR Workshop Proceedings Vol- 2753 (2020) 415-429. [61] C. M. Fedorov, A. Berko, Y. Matseliukh, V. Schuchmann, I. Budz, O. Garbich-Moshora, M. Mamchyn, Decision support system for formation and implementing orders based on cross programming and cloud computing, CEUR Workshop Proceedings Vol-2917 (2021) 714–748. [62] M. Bublyk, V. Mykhailov, Y. Matseliukh, T. Pihniak, A. Selskyi, I. Grybyk, Change management in R&D-quality costs in challenges of the global economy, CEUR Workshop Proceedings Vol-2870 (2021) 1139–1151. [63] V. Vysotska, A. Berko, M. Bublyk, L. Chyrun, A. Vysotsky, K. Doroshkevych, Methods and tools for web resources processing in e-commercial content systems, in: Int. Scientific and Technical Conference on Computer Sciences and Information Technologies, 2020, pp. 114-118. [64] A. Berko, I. Pelekh, L. Chyrun, M. Bublyk, I. Bobyk, Y. Matseliukh, L. Chyrun, Application of ontologies and meta-models for dynamic integration of weakly structured data, in: Proceedings of International Conference on Data Stream Mining and Processing, DSMP, 2020, pp. 432-437. [65] M. Bublyk, V. Lytvyn, V. Vysotska, L. Chyrun, Y. Matseliukh, N. Sokulska, The Decision Tree Usage for the Results Analysis of the Psychophysiological Testing, CEUR workshop proceedings Vol-2753 (2020) 458-472. [66] A. Agresti, Analysis of Ordinal Categorical Data, John Wiley & Sons, 1984. [67] S. Glen, Kendall's Tau (Kendall Rank Correlation Coefficient), Elementary Statistics for the rest of us, 2022. URL: https://www.statisticshowto.com/kendalls-tau/. [68] Construction of an interval variable sequence of continuous quantitative data, 2022. URL: https://stud.com.ua/93314/statistika/pobudova_intervalnogo_variatsiynogo_ryadu_bezperernih_ki lkisnih_danih. [69] M. Bublyk, V. Feshchyn, L. Bekirova, O. Khomuliak, Sustainable Development by a Statistical Analysis of Country Rankings by the Population Happiness Level, CEUR Workshop Proceedings 3171 (2022) 817–837. [70] Forecasting the trend of the time series by algorithmic methods, 2022. URL: http://ubooks.com.ua/books/000269/inx42.php. [71] M. Bublyk, I. Klymus, B. Tsoniev, V. Zatkhei, Comparative Analysis of The Caloric Performance of Products for People with Cardiovascular Disease, CEUR Workshop Proceedings 3171(2022) 838–857. [72] Statistical models of marketing decisions taking into account the uncertainty factor, 2022. URL: https://excel2.ru/articles/uroven-znachimosti-i-uroven-nadezhnosti-v-ms-excel. [73] F. X. Diebold, Econometrics. Streamlined, Applied and e-Aware, Mc Graw Hill, Boston, 2013. [74] N. Vlasova, M. Bublyk, Intelligent Analysis Impact of the COVID-19 Pandemic on Juvenile Drug Use and Proliferation, CEUR Workshop Proceedings 3171 (2022) 858–876. [75] M. J. Schervish, Theory of Statistics, Springer Science & Business Media, New York, 2012. [76] Grouping of statistical data - BukLib.net Library, 2022. URL: https://buklib.net/books/35946/ [77] O. Prokipchuk, L. Chyrun, M. Bublyk, V. Panasyuk, V. Yakimtsov, R. Kovalchuk, Intelligent system for checking the authenticity of goods based on blockchain technology, CEUR Workshop Proceedings Vol-2917 (2021) 618-665. [78] C. Baum, An Introduction to Modern Econometrics Using Stata, Mc Graw Hill, Boston, 2020. [79] Standard error, 2022. URL: https://ua.nesrakonk.ru/standard-error/. [80] Standard deviation, 2022. URL: https://studopedia.su/10_11382_standartne-vidhilennya.html. [81] K.O. Soroka, Fundamentals of Systems Theory and Systems Analysis, Kharkiv, 2004. [82] A. Kowalska-Styczen, K. Sznajd-Weron, From consumer decision to market share - unanimity of majority? JASSS, 19(4) (2016). DOI:10.18564/jasss.3156. [83] I.V. Stetsenko, Systems modeling, Cherkasy, 2010. [84] S.S. Velykodnyi, Modeling of systems, Odessa, 2018. [85] Graphic presentation of information, 2022. URL: https://studopedia.com.ua/1_132145_grafichne-podannya-informatsii.html. [86] Y. Yusyn, T. Zabolotnia, Methods of Acceleration of Term Correlation Matrix Calculation in the Island Text Clustering Method, CEUR workshop proceedings Vol-2604 (2020) 140-150. [87] N. Romanyshyn Algorithm for Disclosing Artistic Concepts in the Correlation of Explicitness and Implicitness of Their Textual Manifestation, CEUR Workshop Proceedings Vol-2870 (2021) 719-730. [88] B. Rusyn, V. Ostap, O. Ostap, A correlation method for fingerprint image recognition using spectral features, in: Proceedings of the International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science, TCSET, 2002, pp. 219–220. [89] Dataset https://www.kaggle.com/adityakadiwal/water-potability. [90] Drinking Water Analysis Solutions https://resources.perkinelmer.com/lab- solutions/resources/docs/BRO_Drinking_Water_Analysis_Solutions_Brochure.pdf. [91] S. Babichev, B. Durnyak, I. Pikh, V. Senkivskyy, An Evaluation of the Objective Clustering Inductive Technology Effectiveness Implemented Using Density-Based and Agglomerative Hierarchical Clustering Algorithms, Lecture Notes in Computational Intelligence and Decision Making 1020 (2020) 532-553. [92] S. Babichev, V. Lytvynenko, V. Osypenko, Implementation of the objective clustering inductive technology based on DBSCAN clustering algorithm, in: Proceedings of Int. Scientific and Technical Conf. on Computer Sciences and Information Technologies, 2017, pp. 479-484. [93] O. Veres, Y. Matseliukh, T. Batiuk, S. Teslia, A. Shakhno, T. Kopach, Y. Romanova, I. Pihulechko, Cluster Analysis of Exclamations and Comments on E-Commerce Products, CEUR Workshop Proceedings Vol-3171 (2022) 1403-1431. [94] S. Babichev, M.A. Taif, V. Lytvynenko, V. Osypenko, Criterial analysis of gene expression sequences to create the objective clustering inductive technology, in: Proceedings of Int. Conf. on Electronics and Nanotechnology, 2017, pp. 244–248. doi: 10.1109/ELNANO.2017.7939756. [95] A. Kowalska-Styczen, K. Sznajd-Weron, Access to information in word of mouth marketing within a cellular automata model. Advances in Complex Systems, 15(8) (2012). DOI:10.1142/S0219525912500804. [96] S. A. Babichev, A. Gozhyj, A. I. Kornelyuk, V. I. Lytvynenko, Objective clustering inductive technology of gene expression profiles based on SOTA clustering algorithm, Biopolymers and Cell 33(5) (2017) 379–392. doi: 10.7124/bc.000961. [97] I. Lurie, V. Lytvynenko, S. Olszewski, M. Voronenko, A. Kornelyuk, U. Zhunissova, О. Boskin, The Use of Inductive Methods to Identify Subtypes of Glioblastomas in Gene Clustering, CEUR Workshop Proceedings Vol-2631 (2020) 406-418. [98] V. Lytvynenko, et. al., Two step density-based object-inductive clustering algorithm, CEUR Workshop Proceedings 2386 (2019) 117–135. [99] I. Lurie, et. al., Inductive technology of the target clusterization of enterprise's economic indicators of Ukraine, CEUR Workshop Proceedings 2353 (2019) 848–859.