Online Stacking Credibilistic Fuzzy Clustering for Data Stream Mining Yevgeniy Bodyanskiy1, Alina Shafronenko1, Diana Rudenko1 and Oleksii Tanianskyi1 1Kharkiv National University of Radio Electronics, Nauky ave 14, Kharkiv, 61166, Ukraine Abstract An important problem that arises when processing large amounts of observations is data compression to highlight the most essential information and identify certain latent factors that implicitly determine the nature of the phenomenon being studied. One of the most effective approaches to solving this problem is the apparatus of factor analysis, which has found wide application in problems of processing empirical data in various fields. Fuzzy clustering is a popular approach for soft data partitioning, its use always encounters difficulties in solving the problems of processing high-dimensional real data with complex hidden distributions. This paper proposes a disclosure of a kind of stack fuzzy clustering method where the data is represented in a new feature space created by a staking neural network. This approach aims to overcome the challenges associated with processing complex data and can bring significant improvements in the quality of clustering. Keywords Stack learning, data compression, fuzzy clustering, credibilistic fuzzy clustering, data stream mining1 1. Introduction Clustering is a technique in machine learning and data analysis that involves grouping a set of data points into subsets, or clusters, based on the similarity between them. Fuzzy clustering is a variation of traditional clustering methods that allows for more flexible and nuanced assignments of data points to clusters [1-5]. In contrast, fuzzy clustering allows data points to belong to multiple clusters simultaneously with varying degrees of membership. This reflects the inherent uncertainty or ambiguity present in real-world data. The Fuzzy C-Means (FCM) algorithm, proposed by James Bezdek in 1973, is a prominent method in fuzzy clustering [6]. FCM assigns membership degrees to data points, indicating the likelihood of each point belonging to different clusters. This flexibility makes fuzzy clustering particularly useful in scenarios where data points may exhibit overlapping characteristics or uncertainty in their categorization. Over the years, fuzzy clustering has found applications in diverse fields, including pattern recognition, image processing, and Data Mining. Researchers have developed various extensions and enhancements to the original FCM algorithm, addressing specific challenges and improving its adaptability to different data patterns. The validity of fuzzy clustering solutions became a key focus, leading to the introduction of indices to assess the quality of clustering results. These indices help researchers and practitioners evaluate the effectiveness of fuzzy clustering algorithms in capturing meaningful patterns within datasets. The evolution of fuzzy clustering has seen ongoing advancements, with researchers exploring sophisticated membership functions and integrating fuzzy clustering with other machine learning techniques. This integration has expanded the capabilities of fuzzy clustering, making it applicable to complex problems in large-scale data analysis. CMIS-2024: Seventh International Workshop on Computer Modeling and Intelligent Systems, May 3, 2024, Zaporizhzhia, Ukraine yevgeniy.bodyanskiy@nure.ua (Ye. Bodyanskiy); alina.shafronenko@nure.ua (A. Shafronenko); diana.rudenko@nure.ua (D. Rudenko), oleksii.tanianskyi@nure.ua (O. Tanianskyi) 0000-0001-5418-2143 (Ye. Bodyanskiy); 0000-0002-8040-0279 (A. Shafronenko); 0000-0002-1792-5080 (D. Rudenko); 0009-0005-3491-4470 (O. Tanianskyi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The era of big data has significantly influenced the field of clustering, including traditional clustering methods and the development of fuzzy clustering techniques. Deep learning in big data represents a powerful combination that has transformed various fields by enabling more sophisticated analysis, pattern recognition, and decision-making capabilities. Recently, there has been significant research into leveraging deep learning to uncover meaningful data representations through neural networks. A notable area of exploration involves the integration of unsupervised clustering algorithms with stack neural networks. This synergy has become a vibrant field of research, aiming to jointly optimize the performance of deep learning models and clustering algorithms. The goal of the work is propose the stack neuro-fuzzy system for Data Stream Mining using credibilistic approach and designed to work both in batch and its recurrent online version. 2. Neural Network Data Compression An important problem that arises when processing large amounts of observations is data compression to highlight the most essential information and identify certain latent factors that implicitly determine the nature of the phenomenon being studied. One of the most effective approaches to solving this problem is the apparatus of factor analysis [7], which has found wide application in problems of processing empirical data in various fields: psychology, sociology, technology, economics, medicine, criminology, etc. The basic idea of factor analysis, which allows for the presence of a priori unknown hidden factors, leads to the following informal task: by observing a large number of measured parameters (indicators), identify a small number of parameter-factors that mainly determine the behavior of the measured parameters, or in other words: knowing the values of a large number of measured functions parameters, set the appropriate values of the factor arguments common to all functions and restore the form of these functions. The initial information for factor analysis is the ( N × n ) observation matrix:  x1 (1) x2 (1) ... xn (1)   xT (1)     T   x1 (2) x2 (2) ... xn (2)   x (2)   ... ... ... ...   ...  =X ( N ) =   , (1)  x1 (k ) x2 (k ) ... xn (k )   xT ( k )   ...    ... ... ...  ...     x1 ( N ) x2 ( N ) ... xn ( N )   xT ( N )    that formed by an array of N n-th dimensional vectors x(k ) = ( x1 , x2 ,..., xN ) and autocorrelation T matrix ( n × n ) 1 N 1 N ∑ ( )( ) ∑ T R ( N= ) x ( k ) − x ( N ) x ( k ) − x ( N ) = x (k ) x T (k ), (2) = N k 1= N k 1 where 1 N x=(k ) ∑ x(k ), x= N k =1 (k ) x(k ) − x (k ) (3) vectors of measured indicators centered relative to the average of the data array. One of the most common and effective methods for finding factors is the principal component method or component analysis, which is widely used in problems of data compression, pattern recognition, coding, image processing, spectral analysis, etc. and known in pattern recognition theory as the Karhunen-Loeve transform. The task of component analysis is to project data vectors from the original n-dimensional space into a m-dimensional one ( m < n ) space of principal components and reduces to searching for a system w1 , w2 ,..., wm orthonormal eigenvectors of the matrix R ( N ) such that w1 = ( w11 , w12 ,..., w1n )T corresponds to the largest eigenvalue λ1 matrix R( N ) , w2 - second largest eigenvalue λ2 , etc. In other words, the problem comes down to finding solutions to the matrix equation: ( R( N ) − λ I ) w = j n 0 j such, that λ1 ≥ λ2 ≥ ... ≥ λm ≥ ε ≥ 0 and w j = 1 . The dimension of the space of principal components m is determined, as a rule, from empirical considerations and the required degree of compression of the data array. Thus, in algebraic terms, solving a factor problem is closely related to the problem of eigenvalues and finding the rank of the correlation matrix; in a geometric sense, this is the problem of moving to a lower-dimensional space with minimal loss of information; in a statistical sense, this is the problem of finding a set of orthonormal vectors in the input space that “accept” the maximum possible variation of the data, and finally, in an algorithmic sense, this is the problem of sequentially determining a set of eigenvectors w1 , w2 ,..., wm by optimizing a set of local criteria that form a global objective function 1 m k 1 m k ( wTj x ( p ) ) 2 =Ek =∑ k =j 1 Ej ∑∑ k =j 1 =p 1 with constraints wTj wl = 0, j ≠ l , wTj w j = 1. The first principal component w1 can be found by maximizing the criterion: 1 k ( w1T x ( p ) ) 2 E1k = ∑ k p =1 (4) by solving a nonlinear programming problem using uncertain Lagrange multipliers. However, if data processing must be carried out in real time, neural network technologies come to the fore, among which the self-learning rule and E. Oya’s neuron should be noted. It is with the help of Oya's rule in the form: = w1 (k ) + η (k ) y1 (k ) ( x (k ) − w1 (k ) y1 (k ) ) ,  w1 (k + 1)  T (5) = y1 (k ) w1 (k ) x (k ), w1 (0) ≠ 0 the first principal component can be isolated. Next, following the procedure of standard principal component analysis, from each vector x (k ), k = 1, 2,..., N its projection onto the first principal component is subtracted and the first principal component of the differences is calculated, which is the second principal component of the original data and the orthonormal first. The third principal component is calculated by projecting each original vector x (k ) into the first two components, subtracting this projection from x (k ) and finding the first principal component of the differences, which is the third principal component of the original data array. The remaining principal components are calculated recursively according to the described strategy. It is this idea of recursive calculation of principal components that forms the basis of the algorithm proposed by T. Sanger [8] and in a modified form having the form [9] 1) w j (k ) + η (k )e j (k ) y j (k ),  w j ( k +=  =e j (k ) e j −1 (k ) − w j (k ) y j (k ),  T = y j (k ) w j (k ) x (k ), w j (0) ≠ 0, (6)  e0 (k ) x= = (k ), j 0, 2,..., m, η (k= −1 2 ) α r (k − 1) + x (k ) , 0 ≤ α ≤ 1.  ) r (k ), r (k= It is easy to see that the first principal component is calculated using the Oya algorithm, then the projection of the input vectors onto w1 (k ) are subtracted from the inputs and the differences are processed by the next neuron, etc. Figure 1: T. Sanger neural network In Fig. 1 shows a diagram of a modified artificial T. Sanger’s neural network, composed of E. Oya’s neurons and implementing the algorithm (6). The first layer of the network is formed by encoder neurons that pre-process signals by centering and normalizing them. Further signals x1 (k ), x2 (k ),..., xn (k ) are processed in the second hidden layer formed by E. Oya's neurons, after which they are sent to the output layer formed by elements with activation rectifier functions with a dead zone u , if u ≥ θ , ψ (u ) =  (7) 0, otherwise which allows you to highlight informative signals y j (k ) and filter out the noise. The Sanger neural network is an effective means of compressing information with minimal loss of accuracy, but its capabilities are limited by the fact that, implementing essentially the standard technique of factor analysis, it solves a linear problem, while the main advantage of neural network technologies is the ability to work in purely nonlinear situations. The problem of nonlinear factor analysis can be effectively solved using credibilistic theory and clustering analysis. 3. Fuzzy credibilistic clustering Alternatively, to probabilistic and possibilistic procedures [10] it was introduced credibilistic fuzzy clustering approach using as its basis the credibility theory [11] and is largely devoid of the drawbacks of known methods. The most common approach within the framework of probabilistic fuzzy clustering is associated with minimizing the goal function [12-14]. N m E ( uq (k ), cq ) = ∑∑ uqβ (k )d 2 ( x(k ), cq ) (8) k 1= = q 1 with constraints m ∑ uq (k ) = 1,  q =1  (9) N 0 < u ( k ) < N .   ∑ k =1 q Solution of nonlinear programming problem using the method of Lagrange indefinite multipliers leads to the well-known result [9, 11, 15]:  1  ( ) d 2 ( x(k ), cq ) 1− β uq (k ) = 1 ,  m ( d 2 ( x(k ), cl ) ) 1− β   ∑  l =1 (10) N  ( ) β  ∑ u q ( k ) x(k ) cq = k =1 N ( uq ( k ) ) β    ∑k =1 coinciding with β = 2 a popular method of Fuzzy C-Means of J. Bezdek (FCM) [6]. If the data are fed to processing sequentially, the solution of the nonlinear programming problem (8), (9) using the Arrow-Hurwitz-Uzawa algorithm leads to an online procedure:  1  ( d 2 ( x ( k + 1), cq ( k ) ) 1− β ) uq (k + 1) =m 1 ,   ∑ ( d ( x(k + 1), cl (k ) ) ) 2 1 − β (11) l =1  cq (k + 1)= c(k ) + η (k + 1)uq (k + 1) ( x(k + 1) − cq (k ) ) . β The goal function of credibilistic fuzzy clustering has the form [6, 11] close to (8) N m E ( Cred q (k ), cq ) = ∑∑ Cred qβ (k )d 2 ( x(k ), cq ) (12) k 1= = q 1 with "softer" than (9) constraints: 0 ≤ Cred q (k ) ≤ 1, for all q and k ,  sup Cred q (k ) ≥ 0,5, for all k , (13)  Cred q (k ) + sup Credl (k ) = 1, for any q and k ,for which Cred (k ) ≥ 0,5.  q It should be noted that the goal functions (8) and (12) are similar and that there are no rigid probabilistic constraints in (13) on the sum of the membership in (9). In the procedures of credibilistic clustering, there is also the concept of fuzzy membership, which is calculated using the neighborhood function of the form: ( uq (k ) = ϕq d ( x(k ), cq ) ) (14) monotonically decreasing on the interval [0, ∞] so that ϕ q (0) = 1,ϕq (∞) → 0. Such a function is essentially an empirical similarity measure of [13, 15, 16] related to distance by the relation: 1 (15) uq ( k ) = . 1 + d 2 ( x( k ), cq ) Note also that earlier it was shown in [14] that the first relation (10) for β = 2 can be rewritten as d 2 ( x(k ), cq )  −1  ) 1 + uq (k=  , (16)  σ q2    where −1  m  σ q = ∑ d ( x(k ), cl )  2  2 (17)  l =1   l ≠ε  which is a generalization of the function (15) (for σ q2 = 1 (15) coincides with (17)) and satisfies all the conditions for (14). In batch form the algorithm of credibilistic fuzzy clustering in the accepted notation can be written as ( ( u (k )= 1 + d 2 x(k ), c −1 ,  q ) q)  ∗ uq (k ) = uq (k ) ( sup ul (k ) ) , −1  1 ∗  Cred = q (k )  uq (k ) + 1 − sup ul∗ (k )  ,  2 l ≠q  (18)  N ( Cred q (k ) ) x(k ) β   ∑ cq = k =1 N ( Cred q (k ) ) β   ∑k =1 and in the online mode, considering (16), (17):  2 1 σ q (k + 1) =m 2 ,   ∑ l =1 d ( x(k + 1), cl (k ) )  l ≠q   d 2 ( x(k + 1), cq (k ) )  −1 u (k + 1) = 1 +  ,  q  σ q2 (k + 1)      u ( k + 1) (19) uq∗ (k + 1) = q ,  sup ul (k + 1)  Cred (k= 1 ∗  + 1)  uq (k + 1) + 1 − sup ul∗ (k )  ,  q 2 l ≠q    c (k + 1)  q = cq (k ) + η (k + 1)Cred qβ (k + 1) ( x(k + 1) − cq (k ) ) . From the point of view of computational implementation, algorithm (19) is not more complicated than procedure (11) and, in the general case, is its generalization to the case of credibilistic approach to fuzzy clustering. 4. Experimental research Conducting experimental studies and comparative analysis of the quality of data clustering using various metrics allows you to objectively assess the effectiveness of the developed method in accordance with analogues. To estimate the quality of the method we used quality criteria partitioning into clusters such as [3, 6]: − Partition Coefficient (PC); − Classification Entropy (CE); − Partition Index (SC); − Separation Index (S); − Xie and Beni's Index (XB); − Dunn's Index (DI). Partition Coefficient (PC): measures the amount of "overlapping" between clusters. Classification Entropy (CE): it measures the fuzzyness of the cluster partition only, which is similar to the Partition Coefficient. Partition Index (SC): is the ratio of the sum of compactness and separation of the clusters. It is a sum of individual cluster validity measures normalized through division by the fuzzy cardinality of each cluster. SC is useful when comparing different partitions having equal number of clusters. A lower value of SC indicates a better partition. Separation Index (S): on the contrary of partition index (SC), the separation index uses a minimum-distance separation for partition validity. Xie and Beni's Index (XB): it aims to quantify the ratio of the total variation within clusters and the separation of clusters. The optimal number of clusters should minimize the value of the index. Dunn's Index (DI): this index is originally proposed to use at the identification of "compact and well separated clusters". So the result of the clustering has to be recalculated as it was a hard partition method. The specific information of the data sets is shown in Table 1. Table 1 Information of the data dets Data set Observations Clusters Parkinson’s telemonitoring 5875 21 Superconductor temperature prediction 10000 81 Table 2 Results of experiments for Parkinson’s telemonitoring data set Methods PC CE SC S XB DI PC -2.3308e-04 9.1249е-07 0.582e-13 24.4028 48.7067 0.2005 0.3733 Online Stack Fuzzy Credibilistic Clustering for Data Stream Mining 0.7473e-04 0.21415 0.00715 1.92845 0.01375 0.3808 0.3954 SOM based on possibilistic fuzzy clustering 0.05725 0.27535 0.2395 0.4731 0.0032 1.7309 0.1699 SOM based on probabilistic fuzzy clustering Table 3 Results of experiments for Superconductor data set Methods CE SC S XB DI PC CE 4.56245е-07 271 150 000 135 900 000 -0.000228 Online Stack Fuzzy Credibilistic Clustering for Data -0.00022 369360 0.0109 Stream Mining 0.0000034 0.000366 0.00585 0.38085 SOM based on possibilistic fuzzy clustering 0.1903 2.8555 0.1903 0.31965 4.29665 0.02415 0.05075 0.31965 0.5375 0.4731 SOM based on probabilistic fuzzy clustering Table 4 A comparison of accuracy Accuracy Data Methods Highest Mean Online Stack Fuzzy Credibilistic Clustering 61.68 60.98 Parkinson’s for Data Stream Mining telemonitoring SOM based on possibilistic fuzzy clustering 61.75 61.30 SOM based on probabilistic fuzzy clustering 62.68 62.98 Online Stack Fuzzy Credibilistic Clustering Superconductor 63.27 58.45 for Data Stream Mining temperature SOM based on possibilistic fuzzy clustering 64.68 55.55 prediction SOM based on probabilistic fuzzy clustering 65.45 57.33 Table 5 Comparative characteristics of the average error with different number of observations (%) Algorithm 50 t 100 t 150 t FCM 1.62 1.19 1.35 2.55 0.98 3.03 SOM based on probabilistic fuzzy 1.66 1.62 1.32 2.72 0.99 3.12 clustering SOM based on possibilistic fuzzy 1.22 1.15 1.02 2.02 0.75 2.10 clustering Online Stack Fuzzy Credibilistic Clustering 0.68 1.00 0.45 1.25 0.12 1.33 for Data Stream Mining 5. Discussions Upon analyzing the results acquired, it can be inferred that irrespective of the volume of the initial data provided, the processing through the proposed method exhibits comparable speed and clustering quality when contrasted with established clustering algorithms and methodologies. The obtained results confirm that the performance stack neuro-fuzzy system is better than other network structures, and it can be a viable structure for Data Stream Mining. The results of accuracy that demonstrated in Table 4 confirm that proposed method online stack fuzzy credibilistic clustering for Data Stream Mining time more superiority regardless of the number observations that are fed on clustering process. Based on the experimental findings, it is advisable to endorse the proposed system for practical application in addressing the challenges associated with automatic clustering of large datasets. 6. Conclusion The problem of fuzzy clustering of data streams by stack neuro-fuzzy system is considered. In the paper was proposed the stack neuro-fuzzy system for Data Stream Mining using credibilistic approach and designed to work both in batch and its recurrent online version. The network shows that stack structures based on fuzzy models can be applicable in data clustering. The proposed stack neuro-fuzzy system is quite simple in numerical implementation and can use the well-known online fuzzy clustering methods intended for solving Data Stream Mining problems. Future research endeavors could explore the potential of employing stack neuro-fuzzy systems for fuzzy clustering of data streams, aiming to address the complexities inherent in automatic clustering of big data. Acknowledgements The work is supported by the state budget scientific research project of Kharkiv National University of Radio Electronics "Deep hybrid systems of computational intelligence for data stream mining and their fast learning" (state registration number 0119U001403). References [1] GAs, Genetic Algorithms. "On Genetic-Fuzzy Data-Mining Techniques." (2023). [2Liu, Hongfu, et al. "Infinite ensemble for image clustering." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. [3] Xu, Jindong, et al. "A fuzzy C-means clustering algorithm based on spatial context model for image segmentation." International Journal of Fuzzy Systems 23 (2021): 816-832. [4] Gong, Maoguo, et al. "Fuzzy clustering with a modified MRF energy function for change detection in synthetic aperture radar images." IEEE Transactions on Fuzzy Systems 22.1 (2013): 98-109. [5] Rani, V., and S. Kumar. "Dissimilarity measure between intuitionistic Fuzzy sets and its applications in pattern recognition and clustering analysis." Journal of Applied Mathematics, Statistics and Informatics 19.1 (2023): 61-77. [6] Bezdek, James C., Robert Ehrlich, and William Full. "FCM: The fuzzy c-means clustering algorithm." Computers & geosciences 10.2-3 (1984): 191-203. [7] Cureton, Edward E., and Ralph B. D'Agostino. Factor analysis: An applied approach. Psychology press, 2013.” (1967). [8] Sanger, Terence D. "Optimal unsupervised learning in a single-layer linear feedforward neural network." Neural networks 2.6 (1989): 459-473. [9] Specht, Donald F. "A general regression neural network." IEEE transactions on neural networks 2.6 (1991): 568-576. [10] Höppner, Frank, et al. Fuzzy cluster analysis: methods for classification, data analysis and image recognition. John Wiley & Sons, 1999. [11Zhou, Jian, et al. "Credibilistic clustering: the model and algorithms." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 23.04 (2015): 545-564. [12] Zhang, Yong, et al. "Possibilistic c-means clustering based on the nearest-neighbour isolation similarity." Journal of Intelligent & Fuzzy Systems 44.2 (2023): 1781-1792. [13] Hu, Zhengbing, et al. "Fuzzy clustering of incomplete data by means of similarity measures." 2019 IEEE 2nd Ukraine Conference on Electrical and Computer Engineering (UKRCON). IEEE, 2019. doi: 10.1109/UKRCON.2019.8879844 [14] Shafronenko, Alina, et al. "Online credibilistic fuzzy clustering of data using membership functions of special type." CMIS. 2020. [15] Bodyanskiy, Yevgeniy, Alina Shafronenko, and Sergii Mashtalir. "Online robust fuzzy clustering of data with omissions using similarity measure of special type." Lecture Notes in Computational Intelligence and Decision Making: Proceedings of the XV International Scientific Conference “Intellectual Systems of Decision Making and Problems of Computational Intelligence” (ISDMCI'2019), Ukraine, May 21–25, 2019 15. Springer International Publishing, 2020. [16] Bodyanskiy, Ye V., A. Yu Shafronenko, and I. N. Klymova. "Online fuzzy clustering of incomplete data using credibilistic approach and similarity measure of special type." Radio Electronics, Computer Science, Control 1 (2021): 97-104.