<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Wmin</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Robust Clustering of Data Streams using Incremental Optimization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Basheer Hawwash</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olfa Nasraoui</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge Discovery and Web Mining Lab Computer Engineering and Computer Science Department University of Louisville</institution>
          ,
          <addr-line>Louisville, KY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <volume>5</volume>
      <issue>20</issue>
      <abstract>
        <p>In many applications, it is useful to detect the evolving patterns in a data stream, and be able to capture them accurately (e.g. detecting the purchasing trends of customers over time on an e-commerce website). Data stream mining is challenging because of harsh constraints due to the continuous arrival of huge amounts of data that prevent unlimited storage and processing in memory, and the lack of control over the data arrival pattern. In this paper, we present a new approach to discover the evolving dense clusters in a dynamic data stream by incrementally updating the cluster parameters using a method based on robust statistics. Our approach exhibits robustness toward an unknown number of outliers, with no assumptions about the number of clusters. Moreover, it can adapt to the evolution of the clusters in the input data stream.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Data Stream Mining has attracted increasing attention in recent years [
        <xref ref-type="bibr" rid="ref2 ref3 ref8">2,3,8</xref>
        ],
due to the proliferation of applications that generate massive streams of data.
Data streams are characterized by a huge throughput which makes it hard if
not impossible to store every single data point and process it, therefore each
new data point can be processed only once. Moreover, there is no control or
expectation on the order of the data arrival, which adds more diculties when
trying to extract useful information from the data streams. An example of a
data stream are the transactions performed on a busy e-commerce website. The
output of the data stream mining, in this case, would be the purchasing patterns
of the customers.
      </p>
      <p>
        To meet the challenge of mining data streams, clustering algorithms should
be scalable enough to cope with huge throughputs of data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Several clustering
algorithms have been designed to deal with large amounts of data by updating
the clustering model incrementally or by processing data points in small batches.
For example, STREAMS [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] rst processes data streams in batches which
generates a set of weighted cluster centers for each batch; then the set of cluster
centers for all batches are clustered. However, it is not suitable to handle or
detect outliers because it is based on minimizing the sum of squared distances
which is very sensitive to extreme outliers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. BIRCH [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is another incremental
clustering algorithm which uses a set of statistical measurements to capture the
closeness of data points. It updates these measurements incrementally and stores
the clusters in a tree of Cluster Features (CF). One of BIRCH’s limitations is
that it treats all the data points in the same way without considering the time
when the data points arrived. Hence, it cannot handle the evolution of data,
where a cluster can disappear, change or merge with other clusters, which is one
of the inherent characteristics of a data stream. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], an algorithm for clustering
data streams, known as TRAC-Streams, was presented. It relied on incremental
processing and robust statistics for learning an evolving model of the clusters in
a data stream. Its objective function consists of two terms: the rst term
minimizes the scaled distances of the good points to the optimal cluster locations
to achieve robustness, while the second term maximizes a soft estimate of the
cluster cardinality (sum of weights) which ensures that as many good points as
possible are used in the estimation process. However, a constant factor is used
to balance the two terms, and had to be chosen carefully and depends on the
data’s dimensionality. DBSCAN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a density based clustering technique that
relies on labeling each data point as a core point, border point or noise point
based on the number of points within a predened distance. This labeling relies
on two important parameters, MinPts (a threshold on density around a core
point) and Eps (a measure of scale used as a threshold on the size of a dense
region). Its incremental version [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (Inc-DBSCAN) achieves the same goal in an
online manner by attempting to label each new data point into one of the above
labels and updating (merging, elimination, etc) the clusters accordingly. It was
proved that Incremental DBSCAN results in clusters that are equivalent to the
original DBSCAN, however it has the same quadratic complexity in terms of the
data size.
      </p>
      <p>
        In this paper, we present a novel algorithm for extracting evolving clusters
from a massive data stream in one pass, while being able to resist and detect the
presence of outliers in the data stream. Our approach is rooted in robust statistics
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] since it uses a robust estimation of centroids (location) as well as a robust and
dynamic estimation of the cluster scales. It is also based on a robust distribution
independent statistical test (Chebyshev) for the detection of outliers and for the
mergal of compatible clusters, thus assuring robustness and compactness of the
extracted cluster model. The statistically based robustness to noise, dynamic
estimation of scales, and fast analytical optimization constitute the distinguishing
contributions of our approach compared to existing stream clustering methods.
The main distinction between RINO-STREAMS and TRAC-STREAMS is in
the simpler objective function which consists of only one term (the density) for
the former, thus eliminating the need to balance two terms in the later. This
results in a simpler and faster optimization. Moreover, RINO-STREAMS has a
better cluster merging technique.
      </p>
      <p>
        The rest of the paper is organized as follows. In Section 2, we describe the
proposed approach (RINO-STREAMS), while in Section 3, we present the
experimental results that validate our algorithm. Finally, we conclude in Section
4.
The proposed algorithm, RINO-STREAMS, is based on incrementally updating
the clustering model using each newly arrived data point, and thus maintaining
the model summaries (clusters) over time. The data stream X consists of a set
of data instances or points that are indexed based on the order of their arrival,
and presented as: x1; x2::::; xN , where N is the size of the data stream. Each
cluster i at time n (i.e. after receiving n points) is dened using its centroid ci;n,
its scale i2;n and its age ti. The centroid represents the location of the cluster
center with respect to other clusters at any time, while the scale represents the
inuence zone around the centroid. The data points are weighted using a weight
function that decreases with the distance from the cluster prototype/centroid as
well as with the time at which the data arrived. Hence, newer data points would
have more eect on the model than older ones, thus capturing the evolution of
clusters over time. Using an exponentially decaying forgetting factor was proved
to be successful in modeling evolution as was done in previous work [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4,5,6</xref>
        ]. The
age is the dierence between the current time/iteration or step ( n) and the time
when the cluster was rst created.
      </p>
      <p>Denition 1. Adaptive Robust Weight: For the ith cluster, Ci; i = 1; :::; K;
the robust weight of the jth data point at time n (i.e. x1; x2; :::; xn points are
encountered so far) is dened as:
wij;n = e</p>
      <p>2
2dii2j;n + n j
where is an optional application-dependent parameter that controls the
time decay rate of old data points (when needed), and how much importance is
given to newer data points. di2j is the Euclidean distance from the jth data point
to the ith cluster’s centroid ci, and i2;n is the scale of cluster i at time/step n and
relates to the size of the inuence zone of the cluster where every point outside
this zone is considered an outlier with respect to cluster ci. The second term
represents a forgetting factor, where the weight of the data point j decreases
geometrically by the value of n j, hence a new data would have a higher
weight since the forgetting factor would be close to zero, while an older point
would have a large value for the forgetting factor which results in a smaller
weight. Assuming that the parameters of the model don’t change signicantly
with every new point, then each old point’s weight, and thus its inuence on the
cluster parameter estimates, would decrease as follows:
wij;n = e
1</p>
      <p>wij;n 1
After encountering n data points, we search for the cluster centroids ci;n and
2
scales i;n by optimizing the density objective function i;n as follows:
8
max &lt;
ci;n; i2;n :</p>
      <p>K n 9
i;n = X X wi2j;n = i = 1; :::; K
i=1 j=1 i;n ;
(1)
(2)
(3)</p>
      <p>
        The robust weight wij;n can also be considered as a typicality or possibilistic
degree of membership [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] of the point j in the cluster i after encountering n
points, hence the sum of the weights for each cluster represents a soft cardinality
of that cluster. A high cardinality is desirable because it means that this cluster
represents enough points to justify its existence. A small scale means that it is a
good and compact cluster. The density of the cluster, i;n, combines the previous
two metrics of the cluster, and hence it increases as the cardinality increases and
the scale decreases. The advantage of optimizing the density, which combines
the two metrics, is that judging the quality of the cluster using only the sum of
weights (the numerator) is not enough, because a cluster with a large number of
similar points which are not conned within a small zone is not desirable from
the point of view of density.
      </p>
      <p>
        Optimizing the density objective function is done using alternating
optimization, where nding the optimal centroid is done by xing all other variables, then
the same is done when optimizing the scale. This optimization technique is also
used in the EM algorithm [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Below we present the optimal update theorems
for the cluster centroids and scale, while omitting the proofs due to the paucity
of space. The proofs are similar to the approach followed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Theorem 1. Optimal Incremental Centroid Update : Given the previous
centroids, ci;n 1, and assuming that the scales do not change much relative to
the scale that resulted from the previous point, the new centroid that optimize
(3) after the nth point is given by:
ci;n =
e 1
e 1</p>
      <p>Pjn=11 wij;n 1xj
Pjn=11 wij;n 1
+ win;nxn
+ win;nxn</p>
      <p>The rst term in the numerator (and denominator) represents the
previous knowledge about the location of the centroid obtained from the points
(x1; :::; xn 1), and this term is multiplied by the forgetting factor to reduce
its eect on the new updated centroid and give more emphasis to the new data
point. This comes directly from the fact that the weight of each point is reduced
by e 1 as given by equation (2). The second term represents the weighted eect
of the new data point on the location of the cluster.</p>
      <p>Theorem 2. Optimal Incremental Scale Update : Given the previous scales,
i;n 1, the new scale that optimize (3) after the arrival of the nth point is given
2
by:
i;n =
e 1</p>
      <p>Pjn=11 wij;n 1di2j</p>
      <p>Similar to the centroid update equation, the rst term in the numerator
represents the sum of the contributions of all previous points (x1; ::; xn 1), and
this value is also penalized by the forgetting factor. The second term represents
the new information obtained from the new data point.
(4)
(5)</p>
      <p>Following these two update equations, the algorithm would update the cluster
parameters with the arrival of a new data point incrementally, and it would keep
as a summary of the data stream only the previous centroids ( ci;n), scales ( i2;n)
and the sums of weights ( Wi;n = Pn</p>
      <p>j=1 wij;n) for each cluster.
2.1</p>
      <p>Detecting outliers
An outlier is dened as a data point that doesn’t belong to any of the existing
clusters (i.e. not in their inuence zone). If a point is determined to be an outlier
with respect to all existing clusters, then a new cluster will be created with the
point itself being its centroid. This newly created cluster will be allowed a grace
period, tmature, and if after this threshold, it is still weak (it has a density less
than a threshold min), then it will be considered an outlying cluster and will be
deleted. To detect outliers we are going to use the Chebyshev bound, an upper
tail bound that bounds the total probability that some random variable is in
the tail of the distribution, i.e. far from the mean. One important property of
Chebyshev bounds is that they rely on no assumptions about the distribution
of the data, rather they assume that a reliable scale estimate is available, which
is the case using RINO-STREAMS by virtue of the robust density and scale
optimization.</p>
      <p>Chebyshev Bounds: The Chebyshev bound for a random variable Y with
standard deviation for any real number t &gt; 0 is given by:</p>
      <p>P r fjY
j
t g
1
t2</p>
      <p>This bound will allow us to design an outlyingness test for any new data point
with respect to cluster Ci with signicance probability 1=t2. The Chebyshev
inequality can be rearranged as follows:</p>
      <p>2
P r e jY2 2 j</p>
      <p>t2
e 2
1 n
t2 Or P r wij
e 2t2 o
1
t2
which means that if the robust weight wij of the point j with respect to cluster
2
Ci is less than the constant value of e 2t , then point j is considered to be an
outlier with a signicance probability of t12 . Moreover, the Chebyshev Bound is
used when nding the cardinality of the cluster when doing the evaluation (i.e.
all points that pass the Chebyshev test are considered part of the cluster with a
signicance probability of 1 t12 ).</p>
      <p>Denition 2. Outlier with Chebyshev probability of t12 : xj is an outlier
with respect to cluster Ci at time n with a signicance probability of t12 if:
wij;n
We further use the Chebyshev bound to design a compatibility test for merging
clusters Ci and Ck. This is done by checking their mutual Chebyshev bounds
with signicance probability 1=t2: Given the distance dik between the centroids
ci and ck, then using (6), the clusters are merged if:
and the age tnew is max(ti; tk). Moreover a test is done to eliminate bad
clusters whose density is less than a threshold ( min) and is mature enough (i.e.
ti &gt; tmature). Splitting clusters in RINO-STREAMS occurs naturally and does
not require a special treatment (see experiments in Section 3.3). A cluster split
occurs when points from a cluster evolve in two or more dierent directions.
The complete steps of RINO-STREAMS are listed in Algorithm 1. The input
parameters to RINO-STREAMS include the maximum number of clusters Kmax
which is an upper bound of the allowed number of clusters which constraints the
maximal amount of memory devoted to storing the cluster model, the initial
scale 02 which is assigned to the newly created cluster, the density threshold
min which is used to ensure that only good clusters with high density are kept,
the maturity age tmature which provides the newly created clusters with grace
period before testing its density quality, the forgetting factor which controls
the decay in the data point weights over time and the Chebyshev bound constant
t that is used in equations (7) and (9).
For each new data point, RINO-STREAMS computes the distance and the
weights with respect to all the clusters in , which is done in linear steps.
Since the clustering model is updated incrementally, then nothing is recomputed
from scratch, and hence the computational complexity of RINO-STREAMS is
O(N K2) where N is the size of the data stream and K is the number of clusters
discovered in the data set (0 K Kmax) which is a very small value compared
to N . Moreover, the memory requirements of RINO-STREAM are linear with
the number of clusters, because at any point of time only the clusters properties
(ci; i2; Wi) are kept besides the new data point. This memory is constrained</p>
      <sec id="sec-1-1">
        <title>Algorithm 1 RINO-STREAMS</title>
        <p>Input: Maximum number of clusters ( Kmax), Initial scale ( 02), density threshold ( min),
maturity age ( tmature), forgetting factor ( ), Chebyshev Bound constant ( t)
Output: Cluster model after n points = C1[C2::::[CK where Ci = (ci;n; i2;n; ti; Wi;n), where
Wi = Pn</p>
        <p>j=1 wij;n
FOR n = 1 TO N DO
//single pass over the data stream</p>
        <p>Compute the distances, din, and robust weights win;n between xn and clusters
Ci8i = 1; ::; K</p>
        <p>IF K &lt; Kmax And xn is an outlier with respect to all clusters in
// Create a new cluster centered on xn
K = K + 1
cK = xn //centroid
K2 = 02 //initial scale
tK = 0 //initial age
WK = 1 //initial sum of robust weights
K = 12 // initial density</p>
        <p>0
END IF
FOR each cluster Ci8i = 1; ::; K
//Update the compatible clusters if xn is not an outlier
IF xn is NOT an outlier with respect to cluster i (see Def. 2)</p>
        <p>Update ci;n using (4)
Update i2;n using (5)
Update sum of weights: Wi;n = e
Update density i;n = W2i;n
i;n
1</p>
        <p>Wi;n 1 + win</p>
        <p>Update age ti = ti + 1</p>
        <p>END IF
END FOR
FOR each pair of clusters Ci&amp;Ck8i; k = 1; ::; K</p>
        <p>IF Ci and Ck are Chebyshev-compatible using equation (9)</p>
        <p>Merge clusters Ciand Ck using equations (10) and (11)</p>
        <p>END IF
END FOR
FOR each cluster Ci8i = 1; ::; K</p>
        <p>IF ti &gt; tmature &amp; i &lt; min //remove mature clusters with low density</p>
        <p>= Ci</p>
        <p>END IF</p>
        <p>END FOR</p>
        <p>(Dim + 3) where Bf is the number of bytes needed to store
one oating point number, and</p>
      </sec>
      <sec id="sec-1-2">
        <title>Dim is the dimensionality (number of data at</title>
        <p>tributes) and refers to the size of the centroid ci and the number 3 refers to the
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Experimental Results</title>
      <p>RINO-STREAMS was evaluated using three synthetic datasets generated using
a random Gaussian generator: DS5 has ve clusters,</p>
      <sec id="sec-2-1">
        <title>DS8 has eight clusters and</title>
        <p>DS16 has 16 clusters. For each dataset, nine stream variations were created by
adding dierent percentages of noise and by changing the order of data arrival.
We will be using a brief code that describes the experiments obtained with these
dataset variations. The rst part of the code is the name of the algorithm used,
the second letter reects the number of true clusters,
Cx where x is the number
of true clusters, then followed by the order of the points arrival ( O: ordered one
cluster at a time then followed by the noise if any, R2: random points from two
clusters at a time followed by noise, R: completely random from all clusters and
noise). The nal part of the code describes the percentage of noise added as Ny
where y is the percentage of noise relative to all the data.
3.1</p>
        <p>
          Experimental Setup:
We compared RINO-STREAMS against two density based online clustering
algorithms: TRAC-STREAMS [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and Inc-DBSCAN [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Since both
RINOSTREAMS and TRAC-STREAMS have similar parameters, we compared the
results between these two algorithms by changing the values of three parameters
while xing the other parameters. These parameters and their values are listed
in Table 1. The rst parameter is the optional forgetting lifetime ( ) where we
assigned a value as a percentage of the data stream length ( jXj) with innity for
the case of no forgetting; the second parameter is the minimum sum of weights
(Wmin) which aects the minimum density threshold value ( min = Wm2in ), and
0
nally the maximum number of clusters allowed ( Kmax) as a percentage of the
real number of clusters in the ground truth ( KG); KG = 5; 8; or 16 depending
on the data set. Some non-critical parameters were xed, which are the initial
scale value 02 = 0:01, maturity age tmature = 100, and Chebyshev constant
1
t2 = 0:075. When comparing against Inc-DBSCAN, we picked the best results
for both methods and compared using dierent evaluation metrics discussed
below, because both algorithms don’t share similar parameters. To optimize the
performance for Inc-DBSCAN, we tried dierent values for M inP ts from 1 to
10, and chose the best value, then we plotted the sorted k-dist graph for each
run and chose the best Eps value as recommended in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Table 2 shows
IncDBSCAN’s optimal values of M inP ts and Eps found manually for the dierent
datasets. We chose Inc-DBSCAN because it is a leading density-based algorithm
and has been widely studied in the literature.
        </p>
        <p>
          To evaluate the results we used four evaluation metrics. Two internal
metrics: Silhouette index [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] (a value close to 1 means that data points were well
clustered) and the similarity matrix [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] containing the similarity values between
every two data points, after ordering the data by their cluster label. The rest of
the metrics are external validation metrics: error in number of clusters detected
and the average centroid error relative to the ground truth. A cluster is
considered to be correctly discovered if its centroid is close enough to one of the found
centroids based on the Chebyshev test (7), and the dierence between the two
scales is less than a threshold value. In the case of Inc-DBSCAN, the centroid
and scale were calculated from the points themselves. Because Inc-DBSCAN
doesn’t have the notion of centroid and scale, we cannot compare these values
with the ground truth. Table 3 explains the x-axis index values for gure 3. All
the experiments against TRAC-STREAMS showed that RINO-STREAMS
performed as good or better than TRAC-STREAMS, hence, and due to paucity of
space, we will only present the comparison against Inc-DBSCAN.
In this section, we compare the quality of clustering of the proposed algorithm
with that of Inc-DBSCAN. The Inc-DBSCAN optimal parameter values are
listed in Table 2. For the proposed algorithm, we set the reasonable value of
Kmax = 200% KG, = 40% jXj and Wmin = 20. Figure 1 shows the clustering
output for the experiment C16RN18, where gure 1(a) shows the ground-truth,
gures 1(b) and 1(c) show the clustering output using Inc-DBSCAN and
RINOSTREAMS respectively. Inc-DBSCAN falsely detected more clusters because it
depends on the notion of density only, so outliers which are close to each other are
considered valid clusters, whereas RINO-STREAMS uses a robust estimation of
scale that makes it more resistant to outliers, and it also uses some quality tests
(using min) to ensure only high quality clusters are maintained. Figure 2 shows
the similarity matrices of the clustering outputs for Inc-DBSCAN and
RINOSTREAMS that serves to validate both cluster outputs. However Inc-DBSCAN
tends to fragment some clusters (e.g.the blocks toward the bottom right corner).
Figure 3 shows the results for DS16, where the x-axis corresponds to the dierent
dataset variations (i.e. dierent noise percentages and order of data arrival)
as explained in Table 3. DBSCAN always overestimates the actual number of
clusters, and it labels small groups of noise as clusters thus nding more false
clusters as the noise increases, whereas RINO-STREAMS has correctly detected
the right number of clusters in most cases. The silhouette index for
RINOSTREAMS is always better than Inc-DBSCAN in all the congurations, which
means a higher quality clustering.
output for
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Experiment</title>
        <p>Ground
Truth
(b)</p>
        <p>Inc-D</p>
      </sec>
      <sec id="sec-2-3">
        <title>BSCAN</title>
        <p>(c)
points</p>
        <p>RINO-STREAMS (noise
detected and removed)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.2
0.3
0.5
0.6
0.7
0.8
0.9</p>
      </sec>
      <sec id="sec-2-4">
        <title>Note 1.</title>
        <p>denote</p>
        <p>Only the points that were not labeled as noise
the Chebyshev bound resulting from the estimated</p>
        <p>Fig. 2: DS16: Similarity Matrix for Experiment
are shown.
scales
C16 RN18</p>
        <p>Contours
500
1000
1500
2000
tsn2500
i
oP
taa3000
D3500
4000
4500
5000
5500
RINO
DBSCAN
Fig. 3: DS16:</p>
        <p>RINO
vs DBSCAN (See</p>
        <p>The
absence
of a
bar means
no
error
To illustrate how clusters merge and split in RINO-STREAMS, we designed
two experiments where one cluster evolves into three dierent clusters to show
cluster splitting, and one where three clusters evolve into one cluster to show
cluster mergal. Figure 4 shows the cluster output evolution at ve dierent time
periods, where time is measured in terms of the number of data points that
arrived relative to the data stream size ( jXj). It can be seen that one cluster
(cluster number 1) is detected at the beginning, and then, as the cluster splits,
two more clusters are detected. Figure 5 illustrates the gradual mergal of three
dierent clusters, over ve dierent time periods, into one cluster.
Note 3. Points in turquoise are old points (their time of arrival &gt;
)
Note 4. Points in turquoise are old points (their time of arrival &gt;
)
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>We presented RINO-STREAMS, a novel algorithm for mining evolving clusters
from a dynamic data stream in one pass, while being able to resist and detect
the presence of outliers in the data stream. Our approach is rooted in robust
statistics since it uses a robust estimation of centroids (location) as well as a
robust and dynamic estimation of the cluster scales. It is also based on a
robust distribution independent statistical test (Chebyshev) for the detection of
outliers and for the mergal of compatible clusters, thus ensuring the robustness
and compactness of the extracted evolving cluster model. The statistically based
robustness, dynamic estimation of scales, and fast analytical optimization
distinguish our approach from existing stream clustering methods. Our experiments
validated the robustness properties of the proposed algorithm and the accuracy
of its clustering model obtained in a single pass compared to two competing
density based clustering algorithms, TRAC-STREAMS and Inc-DBSCAN.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgment</title>
      <p>This work was supported by US National Science Foundation Grant IIS-0916489</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Barbara</surname>
          </string-name>
          ,
          <article-title>Requirements for clustering data streams</article-title>
          ,
          <source>ACM SIGKDD Explorations Newsletter</source>
          , vol.
          <volume>3</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>2327</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>O'Callaghan</surname>
          </string-name>
          ,
          <article-title>Clustering data streams</article-title>
          ,
          <source>in IEEE Symposium on Foundations of Computer Science (FOCS'00)</source>
          , Redondo Beach, CA,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Livny</surname>
          </string-name>
          ,
          <string-name>
            <surname>Birch:</surname>
          </string-name>
          <article-title>An ecient data clustering method for large databases</article-title>
          ,
          <source>in ACM SIGMOD International Conference on Management of Data</source>
          , New York, NY,
          <year>1996</year>
          , pp.
          <fpage>103114</fpage>
          , ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ester</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Qian</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Density-based clustering over an evolving data stream with</article-title>
          noise In
          <source>2006 SIAM Conference on Data Mining</source>
          ,
          <year>2006</year>
          ,
          <fpage>328</fpage>
          -
          <lpage>339</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rojas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          , Tecnostreams:
          <article-title>Tracking evolving clusters in noisy data streams with a scalable immune system learning model</article-title>
          ,
          <source>in Third IEEE International Conference on Data Mining (ICDM'03)</source>
          , Melbourne, FL,
          <year>November 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Tu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>Density-based clustering for real-time stream data KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining</article-title>
          ,
          <source>ACM</source>
          ,
          <year>2007</year>
          ,
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Huber</surname>
          </string-name>
          , Robust Statistics, John Wiley &amp; Sons, New York,
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Nasraoui</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rojas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Robust Clustering for Tracking Noisy Evolving Data Streams SDM,</article-title>
          <year>2006</year>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Ester</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; peter Kriegel, H.; S,
          <string-name>
            <given-names>J</given-names>
            . &amp;
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          AAAI Press,
          <year>1996</year>
          ,
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ester</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kriegel</surname>
            ,
            <given-names>H.-P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sander</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Wimmer,
          <string-name>
            <given-names>M.</given-names>
            &amp;
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <article-title>Incremental Clustering for Mining in a Data Warehousing Environment PROC</article-title>
          .
          <source>24TH INT. CONF. Very Large Data Bases, VLDB</source>
          ,
          <year>1998</year>
          ,
          <fpage>323</fpage>
          -
          <lpage>333</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Borman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>The expectation maximization algorithm: A short tutorial</article-title>
          .
          <source>2004</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steinbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , Introduction to Data Mining, Addison Wesley,
          <year>2005</year>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Krishnapuram</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>J. M. Keller</surname>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>The possibilistic c-means algorithm: insights and recommendations</article-title>
          .
          <source>IEEE Transactions on Fuzzy Systems</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ),
          <fpage>385</fpage>
          -
          <lpage>393</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>