Introduction

Wmin

Robust Clustering of Data Streams using Incremental Optimization

Basheer Hawwash

Olfa Nasraoui

0 0 Knowledge Discovery and Web Mining Lab Computer Engineering and Computer Science Department University of Louisville , Louisville, KY , USA

5 20

In many applications, it is useful to detect the evolving patterns in a data stream, and be able to capture them accurately (e.g. detecting the purchasing trends of customers over time on an e-commerce website). Data stream mining is challenging because of harsh constraints due to the continuous arrival of huge amounts of data that prevent unlimited storage and processing in memory, and the lack of control over the data arrival pattern. In this paper, we present a new approach to discover the evolving dense clusters in a dynamic data stream by incrementally updating the cluster parameters using a method based on robust statistics. Our approach exhibits robustness toward an unknown number of outliers, with no assumptions about the number of clusters. Moreover, it can adapt to the evolution of the clusters in the input data stream.

Introduction

Data Stream Mining has attracted increasing attention in recent years [ 2,3,8 ], due to the proliferation of applications that generate massive streams of data. Data streams are characterized by a huge throughput which makes it hard if not impossible to store every single data point and process it, therefore each new data point can be processed only once. Moreover, there is no control or expectation on the order of the data arrival, which adds more diculties when trying to extract useful information from the data streams. An example of a data stream are the transactions performed on a busy e-commerce website. The output of the data stream mining, in this case, would be the purchasing patterns of the customers.

To meet the challenge of mining data streams, clustering algorithms should be scalable enough to cope with huge throughputs of data [ 1 ]. Several clustering algorithms have been designed to deal with large amounts of data by updating the clustering model incrementally or by processing data points in small batches. For example, STREAMS [ 2 ] rst processes data streams in batches which generates a set of weighted cluster centers for each batch; then the set of cluster centers for all batches are clustered. However, it is not suitable to handle or detect outliers because it is based on minimizing the sum of squared distances which is very sensitive to extreme outliers [ 7 ]. BIRCH [ 3 ] is another incremental clustering algorithm which uses a set of statistical measurements to capture the closeness of data points. It updates these measurements incrementally and stores the clusters in a tree of Cluster Features (CF). One of BIRCH’s limitations is that it treats all the data points in the same way without considering the time when the data points arrived. Hence, it cannot handle the evolution of data, where a cluster can disappear, change or merge with other clusters, which is one of the inherent characteristics of a data stream. In [ 8 ], an algorithm for clustering data streams, known as TRAC-Streams, was presented. It relied on incremental processing and robust statistics for learning an evolving model of the clusters in a data stream. Its objective function consists of two terms: the rst term minimizes the scaled distances of the good points to the optimal cluster locations to achieve robustness, while the second term maximizes a soft estimate of the cluster cardinality (sum of weights) which ensures that as many good points as possible are used in the estimation process. However, a constant factor is used to balance the two terms, and had to be chosen carefully and depends on the data’s dimensionality. DBSCAN [ 9 ] is a density based clustering technique that relies on labeling each data point as a core point, border point or noise point based on the number of points within a predened distance. This labeling relies on two important parameters, MinPts (a threshold on density around a core point) and Eps (a measure of scale used as a threshold on the size of a dense region). Its incremental version [ 10 ] (Inc-DBSCAN) achieves the same goal in an online manner by attempting to label each new data point into one of the above labels and updating (merging, elimination, etc) the clusters accordingly. It was proved that Incremental DBSCAN results in clusters that are equivalent to the original DBSCAN, however it has the same quadratic complexity in terms of the data size.

In this paper, we present a novel algorithm for extracting evolving clusters from a massive data stream in one pass, while being able to resist and detect the presence of outliers in the data stream. Our approach is rooted in robust statistics [ 7 ] since it uses a robust estimation of centroids (location) as well as a robust and dynamic estimation of the cluster scales. It is also based on a robust distribution independent statistical test (Chebyshev) for the detection of outliers and for the mergal of compatible clusters, thus assuring robustness and compactness of the extracted cluster model. The statistically based robustness to noise, dynamic estimation of scales, and fast analytical optimization constitute the distinguishing contributions of our approach compared to existing stream clustering methods. The main distinction between RINO-STREAMS and TRAC-STREAMS is in the simpler objective function which consists of only one term (the density) for the former, thus eliminating the need to balance two terms in the later. This results in a simpler and faster optimization. Moreover, RINO-STREAMS has a better cluster merging technique.

The rest of the paper is organized as follows. In Section 2, we describe the proposed approach (RINO-STREAMS), while in Section 3, we present the experimental results that validate our algorithm. Finally, we conclude in Section 4. The proposed algorithm, RINO-STREAMS, is based on incrementally updating the clustering model using each newly arrived data point, and thus maintaining the model summaries (clusters) over time. The data stream X consists of a set of data instances or points that are indexed based on the order of their arrival, and presented as: x1; x2::::; xN , where N is the size of the data stream. Each cluster i at time n (i.e. after receiving n points) is dened using its centroid ci;n, its scale i2;n and its age ti. The centroid represents the location of the cluster center with respect to other clusters at any time, while the scale represents the inuence zone around the centroid. The data points are weighted using a weight function that decreases with the distance from the cluster prototype/centroid as well as with the time at which the data arrived. Hence, newer data points would have more eect on the model than older ones, thus capturing the evolution of clusters over time. Using an exponentially decaying forgetting factor was proved to be successful in modeling evolution as was done in previous work [ 4,5,6 ]. The age is the dierence between the current time/iteration or step ( n) and the time when the cluster was rst created.

Denition 1. Adaptive Robust Weight: For the ith cluster, Ci; i = 1; :::; K; the robust weight of the jth data point at time n (i.e. x1; x2; :::; xn points are encountered so far) is dened as: wij;n = e

2 2dii2j;n + n j where is an optional application-dependent parameter that controls the time decay rate of old data points (when needed), and how much importance is given to newer data points. di2j is the Euclidean distance from the jth data point to the ith cluster’s centroid ci, and i2;n is the scale of cluster i at time/step n and relates to the size of the inuence zone of the cluster where every point outside this zone is considered an outlier with respect to cluster ci. The second term represents a forgetting factor, where the weight of the data point j decreases geometrically by the value of n j, hence a new data would have a higher weight since the forgetting factor would be close to zero, while an older point would have a large value for the forgetting factor which results in a smaller weight. Assuming that the parameters of the model don’t change signicantly with every new point, then each old point’s weight, and thus its inuence on the cluster parameter estimates, would decrease as follows: wij;n = e 1

wij;n 1 After encountering n data points, we search for the cluster centroids ci;n and 2 scales i;n by optimizing the density objective function i;n as follows: 8 max < ci;n; i2;n :

K n 9 i;n = X X wi2j;n = i = 1; :::; K i=1 j=1 i;n ; (1) (2) (3)

The robust weight wij;n can also be considered as a typicality or possibilistic degree of membership [ 13 ] of the point j in the cluster i after encountering n points, hence the sum of the weights for each cluster represents a soft cardinality of that cluster. A high cardinality is desirable because it means that this cluster represents enough points to justify its existence. A small scale means that it is a good and compact cluster. The density of the cluster, i;n, combines the previous two metrics of the cluster, and hence it increases as the cardinality increases and the scale decreases. The advantage of optimizing the density, which combines the two metrics, is that judging the quality of the cluster using only the sum of weights (the numerator) is not enough, because a cluster with a large number of similar points which are not conned within a small zone is not desirable from the point of view of density.

Optimizing the density objective function is done using alternating optimization, where nding the optimal centroid is done by xing all other variables, then the same is done when optimizing the scale. This optimization technique is also used in the EM algorithm [ 11 ]. Below we present the optimal update theorems for the cluster centroids and scale, while omitting the proofs due to the paucity of space. The proofs are similar to the approach followed in [ 8 ].

Theorem 1. Optimal Incremental Centroid Update : Given the previous centroids, ci;n 1, and assuming that the scales do not change much relative to the scale that resulted from the previous point, the new centroid that optimize (3) after the nth point is given by: ci;n = e 1 e 1

Pjn=11 wij;n 1xj Pjn=11 wij;n 1 + win;nxn + win;nxn

The rst term in the numerator (and denominator) represents the previous knowledge about the location of the centroid obtained from the points (x1; :::; xn 1), and this term is multiplied by the forgetting factor to reduce its eect on the new updated centroid and give more emphasis to the new data point. This comes directly from the fact that the weight of each point is reduced by e 1 as given by equation (2). The second term represents the weighted eect of the new data point on the location of the cluster.

Theorem 2. Optimal Incremental Scale Update : Given the previous scales, i;n 1, the new scale that optimize (3) after the arrival of the nth point is given 2 by: i;n = e 1

Pjn=11 wij;n 1di2j

Similar to the centroid update equation, the rst term in the numerator represents the sum of the contributions of all previous points (x1; ::; xn 1), and this value is also penalized by the forgetting factor. The second term represents the new information obtained from the new data point. (4) (5)

Following these two update equations, the algorithm would update the cluster parameters with the arrival of a new data point incrementally, and it would keep as a summary of the data stream only the previous centroids ( ci;n), scales ( i2;n) and the sums of weights ( Wi;n = Pn

j=1 wij;n) for each cluster. 2.1

Detecting outliers An outlier is dened as a data point that doesn’t belong to any of the existing clusters (i.e. not in their inuence zone). If a point is determined to be an outlier with respect to all existing clusters, then a new cluster will be created with the point itself being its centroid. This newly created cluster will be allowed a grace period, tmature, and if after this threshold, it is still weak (it has a density less than a threshold min), then it will be considered an outlying cluster and will be deleted. To detect outliers we are going to use the Chebyshev bound, an upper tail bound that bounds the total probability that some random variable is in the tail of the distribution, i.e. far from the mean. One important property of Chebyshev bounds is that they rely on no assumptions about the distribution of the data, rather they assume that a reliable scale estimate is available, which is the case using RINO-STREAMS by virtue of the robust density and scale optimization.

Chebyshev Bounds: The Chebyshev bound for a random variable Y with standard deviation for any real number t > 0 is given by:

P r fjY j t g 1 t2

This bound will allow us to design an outlyingness test for any new data point with respect to cluster Ci with signicance probability 1=t2. The Chebyshev inequality can be rearranged as follows:

2 P r e jY2 2 j

t2 e 2 1 n t2 Or P r wij e 2t2 o 1 t2 which means that if the robust weight wij of the point j with respect to cluster 2 Ci is less than the constant value of e 2t , then point j is considered to be an outlier with a signicance probability of t12 . Moreover, the Chebyshev Bound is used when nding the cardinality of the cluster when doing the evaluation (i.e. all points that pass the Chebyshev test are considered part of the cluster with a signicance probability of 1 t12 ).

Denition 2. Outlier with Chebyshev probability of t12 : xj is an outlier with respect to cluster Ci at time n with a signicance probability of t12 if: wij;n We further use the Chebyshev bound to design a compatibility test for merging clusters Ci and Ck. This is done by checking their mutual Chebyshev bounds with signicance probability 1=t2: Given the distance dik between the centroids ci and ck, then using (6), the clusters are merged if: and the age tnew is max(ti; tk). Moreover a test is done to eliminate bad clusters whose density is less than a threshold ( min) and is mature enough (i.e. ti > tmature). Splitting clusters in RINO-STREAMS occurs naturally and does not require a special treatment (see experiments in Section 3.3). A cluster split occurs when points from a cluster evolve in two or more dierent directions. The complete steps of RINO-STREAMS are listed in Algorithm 1. The input parameters to RINO-STREAMS include the maximum number of clusters Kmax which is an upper bound of the allowed number of clusters which constraints the maximal amount of memory devoted to storing the cluster model, the initial scale 02 which is assigned to the newly created cluster, the density threshold min which is used to ensure that only good clusters with high density are kept, the maturity age tmature which provides the newly created clusters with grace period before testing its density quality, the forgetting factor which controls the decay in the data point weights over time and the Chebyshev bound constant t that is used in equations (7) and (9). For each new data point, RINO-STREAMS computes the distance and the weights with respect to all the clusters in , which is done in linear steps. Since the clustering model is updated incrementally, then nothing is recomputed from scratch, and hence the computational complexity of RINO-STREAMS is O(N K2) where N is the size of the data stream and K is the number of clusters discovered in the data set (0 K Kmax) which is a very small value compared to N . Moreover, the memory requirements of RINO-STREAM are linear with the number of clusters, because at any point of time only the clusters properties (ci; i2; Wi) are kept besides the new data point. This memory is constrained

Algorithm 1 RINO-STREAMS

Input: Maximum number of clusters ( Kmax), Initial scale ( 02), density threshold ( min), maturity age ( tmature), forgetting factor ( ), Chebyshev Bound constant ( t) Output: Cluster model after n points = C1[C2::::[CK where Ci = (ci;n; i2;n; ti; Wi;n), where Wi = Pn

j=1 wij;n FOR n = 1 TO N DO //single pass over the data stream

Compute the distances, din, and robust weights win;n between xn and clusters Ci8i = 1; ::; K

IF K < Kmax And xn is an outlier with respect to all clusters in // Create a new cluster centered on xn K = K + 1 cK = xn //centroid K2 = 02 //initial scale tK = 0 //initial age WK = 1 //initial sum of robust weights K = 12 // initial density

0 END IF FOR each cluster Ci8i = 1; ::; K //Update the compatible clusters if xn is not an outlier IF xn is NOT an outlier with respect to cluster i (see Def. 2)

Update ci;n using (4) Update i2;n using (5) Update sum of weights: Wi;n = e Update density i;n = W2i;n i;n 1

Wi;n 1 + win

Update age ti = ti + 1

END IF END FOR FOR each pair of clusters Ci&Ck8i; k = 1; ::; K

IF Ci and Ck are Chebyshev-compatible using equation (9)

Merge clusters Ciand Ck using equations (10) and (11)

END IF END FOR FOR each cluster Ci8i = 1; ::; K

IF ti > tmature & i < min //remove mature clusters with low density

= Ci

END IF

END FOR

(Dim + 3) where Bf is the number of bytes needed to store one oating point number, and

Dim is the dimensionality (number of data at

tributes) and refers to the size of the centroid ci and the number 3 refers to the 3

Experimental Results

RINO-STREAMS was evaluated using three synthetic datasets generated using a random Gaussian generator: DS5 has ve clusters,

DS8 has eight clusters and

DS16 has 16 clusters. For each dataset, nine stream variations were created by adding dierent percentages of noise and by changing the order of data arrival. We will be using a brief code that describes the experiments obtained with these dataset variations. The rst part of the code is the name of the algorithm used, the second letter reects the number of true clusters, Cx where x is the number of true clusters, then followed by the order of the points arrival ( O: ordered one cluster at a time then followed by the noise if any, R2: random points from two clusters at a time followed by noise, R: completely random from all clusters and noise). The nal part of the code describes the percentage of noise added as Ny where y is the percentage of noise relative to all the data. 3.1

Experimental Setup: We compared RINO-STREAMS against two density based online clustering algorithms: TRAC-STREAMS [ 8 ] and Inc-DBSCAN [ 10 ]. Since both RINOSTREAMS and TRAC-STREAMS have similar parameters, we compared the results between these two algorithms by changing the values of three parameters while xing the other parameters. These parameters and their values are listed in Table 1. The rst parameter is the optional forgetting lifetime ( ) where we assigned a value as a percentage of the data stream length ( jXj) with innity for the case of no forgetting; the second parameter is the minimum sum of weights (Wmin) which aects the minimum density threshold value ( min = Wm2in ), and 0 nally the maximum number of clusters allowed ( Kmax) as a percentage of the real number of clusters in the ground truth ( KG); KG = 5; 8; or 16 depending on the data set. Some non-critical parameters were xed, which are the initial scale value 02 = 0:01, maturity age tmature = 100, and Chebyshev constant 1 t2 = 0:075. When comparing against Inc-DBSCAN, we picked the best results for both methods and compared using dierent evaluation metrics discussed below, because both algorithms don’t share similar parameters. To optimize the performance for Inc-DBSCAN, we tried dierent values for M inP ts from 1 to 10, and chose the best value, then we plotted the sorted k-dist graph for each run and chose the best Eps value as recommended in [ 9 ]. Table 2 shows IncDBSCAN’s optimal values of M inP ts and Eps found manually for the dierent datasets. We chose Inc-DBSCAN because it is a leading density-based algorithm and has been widely studied in the literature.

To evaluate the results we used four evaluation metrics. Two internal metrics: Silhouette index [ 12 ] (a value close to 1 means that data points were well clustered) and the similarity matrix [ 12 ] containing the similarity values between every two data points, after ordering the data by their cluster label. The rest of the metrics are external validation metrics: error in number of clusters detected and the average centroid error relative to the ground truth. A cluster is considered to be correctly discovered if its centroid is close enough to one of the found centroids based on the Chebyshev test (7), and the dierence between the two scales is less than a threshold value. In the case of Inc-DBSCAN, the centroid and scale were calculated from the points themselves. Because Inc-DBSCAN doesn’t have the notion of centroid and scale, we cannot compare these values with the ground truth. Table 3 explains the x-axis index values for gure 3. All the experiments against TRAC-STREAMS showed that RINO-STREAMS performed as good or better than TRAC-STREAMS, hence, and due to paucity of space, we will only present the comparison against Inc-DBSCAN. In this section, we compare the quality of clustering of the proposed algorithm with that of Inc-DBSCAN. The Inc-DBSCAN optimal parameter values are listed in Table 2. For the proposed algorithm, we set the reasonable value of Kmax = 200% KG, = 40% jXj and Wmin = 20. Figure 1 shows the clustering output for the experiment C16RN18, where gure 1(a) shows the ground-truth, gures 1(b) and 1(c) show the clustering output using Inc-DBSCAN and RINOSTREAMS respectively. Inc-DBSCAN falsely detected more clusters because it depends on the notion of density only, so outliers which are close to each other are considered valid clusters, whereas RINO-STREAMS uses a robust estimation of scale that makes it more resistant to outliers, and it also uses some quality tests (using min) to ensure only high quality clusters are maintained. Figure 2 shows the similarity matrices of the clustering outputs for Inc-DBSCAN and RINOSTREAMS that serves to validate both cluster outputs. However Inc-DBSCAN tends to fragment some clusters (e.g.the blocks toward the bottom right corner). Figure 3 shows the results for DS16, where the x-axis corresponds to the dierent dataset variations (i.e. dierent noise percentages and order of data arrival) as explained in Table 3. DBSCAN always overestimates the actual number of clusters, and it labels small groups of noise as clusters thus nding more false clusters as the noise increases, whereas RINO-STREAMS has correctly detected the right number of clusters in most cases. The silhouette index for RINOSTREAMS is always better than Inc-DBSCAN in all the congurations, which means a higher quality clustering. output for

Experiment

Ground Truth (b)

Inc-D

BSCAN

RINO-STREAMS (noise detected and removed) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.5 0.6 0.7 0.8 0.9

Note 1.

denote

Only the points that were not labeled as noise the Chebyshev bound resulting from the estimated

Fig. 2: DS16: Similarity Matrix for Experiment are shown. scales C16 RN18

Contours 500 1000 1500 2000 tsn2500 i oP taa3000 D3500 4000 4500 5000 5500 RINO DBSCAN Fig. 3: DS16:

RINO vs DBSCAN (See

The absence of a bar means no error To illustrate how clusters merge and split in RINO-STREAMS, we designed two experiments where one cluster evolves into three dierent clusters to show cluster splitting, and one where three clusters evolve into one cluster to show cluster mergal. Figure 4 shows the cluster output evolution at ve dierent time periods, where time is measured in terms of the number of data points that arrived relative to the data stream size ( jXj). It can be seen that one cluster (cluster number 1) is detected at the beginning, and then, as the cluster splits, two more clusters are detected. Figure 5 illustrates the gradual mergal of three dierent clusters, over ve dierent time periods, into one cluster. Note 3. Points in turquoise are old points (their time of arrival > ) Note 4. Points in turquoise are old points (their time of arrival > ) 4

Conclusion

We presented RINO-STREAMS, a novel algorithm for mining evolving clusters from a dynamic data stream in one pass, while being able to resist and detect the presence of outliers in the data stream. Our approach is rooted in robust statistics since it uses a robust estimation of centroids (location) as well as a robust and dynamic estimation of the cluster scales. It is also based on a robust distribution independent statistical test (Chebyshev) for the detection of outliers and for the mergal of compatible clusters, thus ensuring the robustness and compactness of the extracted evolving cluster model. The statistically based robustness, dynamic estimation of scales, and fast analytical optimization distinguish our approach from existing stream clustering methods. Our experiments validated the robustness properties of the proposed algorithm and the accuracy of its clustering model obtained in a single pass compared to two competing density based clustering algorithms, TRAC-STREAMS and Inc-DBSCAN.

Acknowledgment

This work was supported by US National Science Foundation Grant IIS-0916489

[1]

Daniel

Barbara , Requirements for clustering data streams , ACM SIGKDD Explorations Newsletter , vol. 3 , no. 2 , pp. 2327 , 2002 .

[2]

Guha ,

Mishra ,

Motwani , and L. O'Callaghan , Clustering data streams , in IEEE Symposium on Foundations of Computer Science (FOCS'00) , Redondo Beach, CA, 2000 .

[3]

Zhang ,

Ramakrishnan , and

Livny , Birch: An ecient data clustering method for large databases , in ACM SIGMOD International Conference on Management of Data , New York, NY, 1996 , pp. 103114 , ACM Press.

[4] Cao , F. ; Ester , M. ; Qian , W. & Zhou , A. Density-based clustering over an evolving data stream with noise In 2006 SIAM Conference on Data Mining , 2006 , 328 - 339

[5]

Nasraoui ,

Cardona ,

Rojas , and

Gonzalez , Tecnostreams: Tracking evolving clusters in noisy data streams with a scalable immune system learning model , in Third IEEE International Conference on Data Mining (ICDM'03) , Melbourne, FL, November 2003 .

[6] Chen , Y. & Tu , L. Density-based clustering for real-time stream data KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining , ACM , 2007 , 133 - 142

[7]

P. J.

Huber , Robust Statistics, John Wiley & Sons, New York, 1981 .

[8] Nasraoui , O. , Rojas , C.

Robust Clustering for Tracking Noisy Evolving Data Streams SDM,

2006

[9] Ester , M. ; peter Kriegel, H.; S, J . & Xu , X. A density-based algorithm for discovering clusters in large spatial databases with noise AAAI Press, 1996 , 226 - 231

[10] Ester , M. ; Kriegel , H.-P. ; Sander , J. ; Wimmer, M. & Xu , X. Incremental Clustering for Mining in a Data Warehousing Environment PROC . 24TH INT. CONF. Very Large Data Bases, VLDB , 1998 , 323 - 333

[11] Borman , S. The expectation maximization algorithm: A short tutorial . 2004

[12] Tan , P.N. , Steinbach , M. , Kumar , V. , Introduction to Data Mining, Addison Wesley, 2005

[13] Krishnapuram , R. and J. M. Keller ( 1996 ). The possibilistic c-means algorithm: insights and recommendations . IEEE Transactions on Fuzzy Systems 4 ( 3 ), 385 - 393 .