<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tetkik: Akan Veri Kümeleme Algoritmalarını Çalıştırma ve Karşılaştırma</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rowanda D. Ahmed</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gökhan Dalkılıç</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Murat Erten</string-name>
          <email>muraterten@iyte.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rowanda D. Ahmed</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gökhan Dalkılıç</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Murat Erten</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Izmir Institute of Technology1,3, Dokuz Eylül University2</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Engineering Department</institution>
          ,
          <addr-line>İzmir</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Izmir Institute of Technology</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recently, clustering data streams have become an incredibly important research area for knowledge discovery as applications produce more and more unstoppable streaming data. In this paper we introduce clustering, streams and data streaming clustering algorithms, as well as discussions of the most important stream clustering algorithms, considering their structure. As an additional contribution of our work and differently from review and survey papers in stream clustering, we offer the practical part of the most known stream clustering algorithms, namely: (i) CluStream; (ii) DenStream; (iii) DStream; and (iv) ClusTree, showing their experimental results along with some performance metrics computation of for each, depending on MOA framework.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Clustering process is the most main and crucial part in the data mining field. And it
passes through many levels of developments [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] especially in recent years. Every
moment the wide, applicability of streaming data leads to an unlimited rate growing
large amounts of data. So, we need non-stopped strategies for analyzing and
clustering the non-stopped online coming stream objects in which there is always an
up-to-date view of the objects seen and clustered so far. Stream data clustering faces
additional challenges beyond those of the static data clustering faces. Some
challenges streams bring are as follows: Single-pass processing the endless amounts
of data streams, as well as maintaining all the necessary summary information which
will be used in the clustering process. Limited time is needed for processing each
record and it must be small. This ensures maintaining always such a current clustering
model and keeping up with the stream speed. Limited memory that processing in the
stream must be done without storing, buffering or revisiting data. Varying time
allowances, that is the usual case for stream data is in in a bursty state. The
continuously Evolving data over time [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is so ubiquitous in today’s applications.
Therefore, the clusters in some point of time may no longer still relevant in the future
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which leads to record many interesting changes and characteristics in an evolving
data stream for many business applications effective usage [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Concept drift,
novelty, number and size of clusters and outlier detection should be considered as
well. In addition to the scalability issue which considered the most important issue in
stream clustering when the data sets are very large [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Data stream applications may
need results at any time, so the model must be available at any time.
      </p>
      <p>Performance metrics. determine how good the obtained clustering reflect the data.
We can classify the Cluster evaluation methods into two kinds: extrinsic methods and
intrinsic methods.</p>
      <p>
        Extrinsic methods (supervised). when the ground truths are available, so we can
assign like score to the clustering. In Extrinsic methods, the ground truth compared to
a clustering results. Purity, Precision, and recall metrics are examples of extrinsic
methods. Precision-Recall is used to measure the success of the prediction, especially
in the very imbalanced classes case. In information retrieval, precision measures the
output relevancy, but recall deals with the returned results and measures how many
truly relevant from them. F1-Score is the harmonic mean or weighted average of
recall and precision to evaluate an algorithm. The F1 score reaches the best score at 1
and worst score at 0. Purity: is a measure of how extent clusters contain a single class
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The higher the purity is the better. A purity score of 1 is possible by putting each
data point in its own class. SSQ: The sum of squared distances over all data points to
their corresponding cluster centers. The smaller the SSQ is the better.
Homogeneity: Each cluster contains only members of a single class. Completeness:
we suppose that all members of some class are assigned to the one cluster. The
Completeness and Homogeneity bounds are from 0 to 1. The higher is the better for
both.
      </p>
      <p>Intrinsic methods (unsupervised). when the ground truths are unavailable. In
Intrinsic methods, how compact the clusters are, how well the clusters are separated
are evaluated, i.e. Intrinsic methods are all about measuring the clustering goodness.
Silhouette coefficient is an example of intrinsic methods. It is a metric used to
compare how the point will fit in some cluster with the assigned one by computing
both values. Silhouette Coefficient is a combination of Cohesion and Separation
measures and its bounds are between 0 and 1. The higher the Silhouette Coefficient,
the better. This survey research paper is organized as follows: in section 2, we will
study in general variety of stream clustering methods. In section 3, 4, 5 and 6 we will
study each of the following stream clustering methods; CluStream, DenStream,
DStream and ClusTree respectively. In section 7 we will compare by running between
the above-mentioned algorithms. Finally, the conclusion will be in section 8.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Stream Clustering Algorithms</title>
      <p>The old unsophisticated data stream algorithms suffer from many problems. One
naive suggested solution refers to stream buffering for later handling, but this is
impossible because the stream is endless. Some approaches lost a valuable
information due to dropping some data to keep up with the stream speed. Other
approaches are suggested to solve the stream speed adaptation, these approaches try to
scale only for the fastest stream speed, which resulted in a poor-quality clustering.
Clearly, these solutions mentioned above do not make the best use of the information
that is contained in the stream and of the time available. The current stream clustering
algorithms each try to solve some of the above-mentioned problems.</p>
      <p>
        Stream clustering can be categorized into three main categories; prototype
methods, density methods, and model-based methods [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Example on
prototypebased algorithms K-Means [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], it aims to partition the objects of the dataset
into k clusters or groups so that each object belongs to the cluster corresponds to the
closest mean, and then refine these clusters iteratively. The density-based algorithms
like DBSCAN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], it groups the points that look close together, considering dense
concentrated data points areas, and then these dense areas form the final clusters.
Then the low-density areas are marks as outliers. Model-based algorithms,
Expectation Maximization (EM) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is an example. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], there are many
clustering methods have been spoken about that are designed for static not stream
data. There are different clustering paradigms studied the data stream clustering.
Earlier data stream clustering paradigms like [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] deal with data stream clustering as
if static clustering but in a continuous version, they are single phase divide and
conquer schemes based. These data stream algorithms divide the data streams into
segments and based on a k-means algorithm to discover clusters in data streams. Such
approaches don’t consider the evolving data due to that they treat the recent and the
outdated data the same. To solve the evolving data problem, moving window
techniques are proposed [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. CluStream [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] uses a two-phase strategy. In the online
phase, it analyzes the stream coming data and stores its summary statistics using
micro-clusters. While in the offline phase, it uses these statistics along with other
parameters to generate final clusters. Wang et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Clustering data streams on the
Two-tier structure, is an online-offline framework. It based on CluStream, but in an
improved offline phase. DenStream [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] presents the structures as core and outlier
micro clusters to maintain and summarize clusters and distinguish the potential
clusters and outliers. CobWeb [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is an incremental system for hierarchical
conceptual clustering data. It uses a classification tree in organizing observations.
Density-based clustering method like D-Stream [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], is a natural and attractive basic
data streams clustering strategy because it can find arbitrarily and interwoven shaped
clusters and can detect and handle noises. BIRCH [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], maintained a hierarchical
index for faster access in the very large databases. A mapping to a frequent item set
for multidimensional streams is represented in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Kernels based approach [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is
proposed to discover arbitrary shape clusters in stream data. Alternative graphs based
is proposed in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and fractal dimensions grid based is proposed in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] proposed a density-based approach. None from the above approaches can adapt
the clustering model size to keep up with the online stream speed nor capable of
delivering such a result or can interrupt the process at any time moment. The anytime
idea is a very active research field in data mining, there are variant algorithms in that
field. ClusTree [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] is a free parameter any-time algorithm, it adapts itself to the
stream speed automatically, in addition to its capability of detecting novelty, outliers,
and concept drift in the stream. To maintain stream summaries, it uses a self-adaptive
and compact index structure.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>CluStream Algorithm</title>
      <p>CluStream is an online-offline Algorithm adopts the idea of streaming over many
time windows, that by streaming over many time windows gains more understanding
about what is going on in clustering process and the clusters behaviors and evolving.
CluStream algorithm stores data summary statistics in its online component, while
uses these statistics along with other parameters in the offline component to achieve
the final clustering results. CluStream algorithm uses the micro-clusters concept as
well as the pyramidal time frame.
3.1</p>
      <sec id="sec-3-1">
        <title>CluStream Clustering Framework</title>
        <p>
          To understand the CluStream framework, it is important to clarify some points.
Firstly, summary statistics in the online component are incrementally updated to cope
with the offline clustering. Secondly, it is important to determine the time interval
between the snapshots for storing the summary statistics. Thirdly, addressing how
CluStream uses the online information benefits in gaining hints for the user in setting
the time horizon and how to deal with the evolution. CluStream store the online
statistics in a form of micro-clusters, for that it uses the feature vector [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] due to its
natural properties of additive, subtraction, and multiplications. Algorithm stores these
micro-clusters in periods of times following pyramidal pattern which provides such
good and reasonable tradeoff between the ability to get the summary statistics stored
in the micro-clusters from various time horizons and the storage requirements.
Definition 1 in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] facilitates understanding of the above-mentioned concepts.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Online phase (Micro-Cluster Maintenance)</title>
        <p>CluStream using many points initially to produce the initial micro-clusters uses
kmeans in its offline phase. Now and after the creation of these initial micro-clusters,
CluStream starts the online process with every arriving data point to update the stored
micro-clusters created previously. The new coming data point either it absorbs in the
nearest one or it may create its new micro-cluster as own. When a new point arrives,
the algorithm computes the distance between the new point and the nearest
microcluster center. If the choice is to merge the arriving data point, the algorithm merges
this data point with the closest micro cluster. Otherwise, if the new point doesn’t be
founded to be within the range of any micro-cluster regarding its radius parameter, the
new data point may categorize as an outlier or as a new micro-cluster seed.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Offline phase - Macro-Cluster Creating</title>
        <p>
          Micro-clusters are efficiently maintained to be as intermediate statistical
representations using the stored micro-clusters summary statistics instead of the very
large volume data stream. This process enables the user from flexibility exploring
stream clusters over different horizons. The user provides two parameters; the first is
the time-horizon h and this benefits in history determination in which to create higher
level clusters. The second parameter is the higher-level clusters number k, and this to
determine if we can find extra detailed clusters. Note that at each step of the
algorithm, the current set of micro-clusters depend on the all the stream processing
entire history since the very beginning of it. To find the clusters over a specific time
horizon, we need to find the corresponding micro-clusters making use from the
additive and subtractive properties which extended from [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
4
        </p>
        <sec id="sec-3-3-1">
          <title>DenStream Algorithm</title>
          <p>In the data stream clustering, maintaining all the data is impossible. Therefore, it is
important to discover algorithms with a single pass clustering, unknown parameters
and evolving the changed data. Micro-cluster technique in stream clustering saves the
necessary information of the data objects the stream, which compresses the data
effectively. The following subsections will be about four well-known density-based
data stream algorithms which extend micro-clustering technique, these are
DenStream, C-DenStream, rDenStream, and SDStream. We will start illustrating
DenStream algorithm with more details because the others based on it.
4.1</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Density Micro-Clustering Algorithms</title>
        <p>
          Adopting the density concept benefits in discovering the arbitrarily shaped clusters by
distinguishing those dense areas from the scattered sparse areas. Due to memory
limits, micro-cluster is a well-known strategy special for this purpose. Micro-clusters
are the optimal representation for the summary statistics stored in the online
component and to be used afterward for the offline component clustering. Fig.1 shows
the framework of micro-clusters in density-based clustering. The p-micro-clusters and
o-micro-clusters [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] are the potential and outlier clusters, which are mainly differ in
their weights. In this section, we will show in brief the density-based micro-clustering
[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] four stream algorithms, DenStream, SDStream, rDenstream and C-DenStream
with their cons and pros. So, the user can choose the suitable algorithm according to
his own preferences and what he is interested in. Like the speed, higher accuracy and
so on.
        </p>
        <p>
          The DenStream algorithm puts a weight for each data point. It is a fast algorithm
due to its ability to delete those outdated data points and the outliers before merging.
And because it doesn't merge data points into a micro-cluster and that facilitate
exploring the outliers and save time. SDStream algorithm uses only the recent data
stream objects, while other algorithms consider the data stream points whole. It stores
the micro-clusters in the EHCF data structure form. So, SDStream save memory and
it can process only with the most recent data points over the sliding window length,
which means better addressing the clusters changes and evolutions. This algorithm is
suitable when the applications interested more in the recent data streams.
rDenStream algorithm based mainly on DenStream; however, rDenStrean emphasize
on handling outliers, such that it chose not to remove the data in the outlier buffer and
relearn from it which result in a higher accuracy compared to DenStream algorithm.
But on the other hand, it is slower than DenStream and a memory usage as well.
CDenStream is a real applicable algorithm. It adopts the constraint concept on
microclusters. In addition, to guide the clustering process, it uses a background knowledge.
In this algorithm, the clusters can't have formed unless they conform with application
semantics like geographical natural borders and objects.
D-Stream is a density-based framework for clustering stream data. It is an
onlineoffline algorithm. The online one continuously reads input data records and maps it
into a density grid and the offline one computes the grid density, detect and removes
sporadic grids from the grid-list and adjust the clustering. D-Stream algorithm
exploits the complicated relationships between the decay factor, data density, and
cluster structure [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. In addition, it captures the dynamic changes by adopting a
density decaying technique to a data stream to generate clusters of arbitrary shapes
effectively and efficiently.
5.1
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>D-Stream Components</title>
        <p>
          Grid inspection and gap determination. D-Stream algorithm progressively does
adjust clustering process every gap time. But what is the optimal time length of the
grid inspection? Note that if we choose this time interval to be too long, data streams
dynamical changes will not be recognized well. On the other hand, that if we choose
this time interval to be too small, then it will result in too much computation, which
makes the work so heavy. D-Stream adopts the idea of setting the time interval to be
the smallest of those two times intervals [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>Sporadic grids removing. The very high number of grids is a critical big challenge
D-Stream algorithm faces. But the most grids are either empty or don’t receive over
and over data records. So, in processing and storing, only those not empty grids can
be regarded, and the others are neglected. The sporadic grids who receive so little data
can be deleted from the data space along with their characteristic vectors and reset
their density to zero. It is experimentally proven that deleting this kind of grids
doesn’t affect the clustering quality. While the sporadic grids with many data records
but their densities became less than the threshold by the effect of decay factor have a
hope to upgrade and to become dense or transitional grids, so it is wrong to delete
these grids.</p>
        <p>
          D-Stream clustering algorithm. The algorithm continues reading from the flow
stream of data records and computes all grids densities. Initial clustering process
generates the initial clusters. After the first gap then adjust-clustering method is
executed repeatedly every time equal gap. Initial-clustering and adjust-clustering
methods are described in detail supported by pseudo codes in [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].
6
        </p>
        <sec id="sec-3-5-1">
          <title>ClusTree</title>
          <p>ClusTree is a parameter-free and any-time data stream algorithm, it finds solutions for
a lot of challenges like limited memory and time, maintaining results in any point of
time and adapting the clustering process according to the steam speed. In addition,
ClusTree uses novel descent strategies for handling the slow streams, and it uses
aggregation mechanisms for handling the fast streams. Also, ClusTree uses an
exponential decay function to give more importance to the more recent data points. It
is a self-adaptive data stream clustering algorithm as well as it is scalable to give
anytime clustering results. ClusTree' efficiency and effectivity are experimentally
approved.
6.1</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>Self adaptive and anytime data stream clustering</title>
        <p>ClusTree is a structure with an index stream clustering algorithm, it can store and
maintain a compact view for any time the user seeks it. It is the first stream clustering
algorithm for the anytime merit. ClusTree is a self-adaptive algorithm so that it can
adapt to the coming data points stream speed automatically.</p>
        <p>
          Micro-clusters and anytime insert. Periodically, ClusTree algorithm computes the
mean and variance of its micro-clusters. As stream data points flow, the cluster
feature is updated incrementally, it can be considered as the true representation of the
micro-clusters. So that, stream data points can be mapped to the true or the most
similar micro-cluster. ClusTree algorithm hierarchy builds micro-clusters with a
granularity at different levels [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. So that, it adopts hierarchical indexing structures
from R-tree family [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] which render preserving cluster features as well as
locate a right place efficiently, that’s to insert any coming stream object into such a
suitable micro-cluster. For the algorithm to determine leaf entry containing the most
similar micro cluster to the new coming object, it descends the tree down until reach
that leaf. If the similarity is enough, algorithm updates the micro-cluster tuple values.
Otherwise, it creates a new micro-cluster and forms its cluster feature (CF) with its
values. ClusTree has a buffer as a temporary place for storing either objects or
aggregates that during insertion, due to the fast stream, can’t reach the leaf level. So
that the cluster feature is stored in a suitable entry buffer when the insertion is
interrupted and so that there is not enough time to descend the tree down to reach the
leaf.
        </p>
        <p>Up-to-date clustering maintaining. ClusTree algorithm, to give the most recent
objects more importance, it uses an exponential famous decay function that depends
on time, ω (t) = β − λ*∆t, with the λ is the decay rate to control the objects weighs in
such a way makes the algorithm can control the forgotten or maintaining objects more
by playing with the decay rate value. As we increase the value of the decay factor, the
old objects are forgotten becoming faster and so on. Every object has its arrival and
last update timestamps which needed in computing the decay function. While the
insertion process and descend the tree down to the leaf, all entries in the node are
updated to the arrival timestamp. So that the node entries all have the same timestamp
always. Every inner node in the tree structure summarizes its subtree. So, contacting
the last update timestamp field in every node, besides using weighting formula,
ensures capturing decay with time correctly. And to avoid splitting, the algorithm
weighs with time. approved.
6.2</p>
      </sec>
      <sec id="sec-3-7">
        <title>Cluster shapes and cluster transitions</title>
        <p>
          There are several strategies suggested by details in [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], can process cluster
transitions like outlier detection, concept drift detection, and novelty. ClusTree can
apply any from these approaches to its output clusters. So, ClusTree algorithm can be
used to detect arbitrary shape clusters, outlier detection, concept drift detection,
novelty and time horizon.
7
        </p>
        <sec id="sec-3-7-1">
          <title>Experimental Results</title>
          <p>
            In experimental results, we adopt the Massive Online Analysis (MOA) framework
[
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] for online learning from data streams. MOA is so closely related to WEKA,
implemented in Java and released under GPL. It includes a collection of online and
offline components in addition to many tools for evaluating classification and
clustering. It is easy to design and run experiments on MOA, and it is easy to extend it
too. The experiments were done on Intel core(TM) i7-4510U CPU @ 2.60GHz,
6.00GB RAM Lenovo. Fig. 2 shows the results from running four stream clustering
algorithms, CluStream, DenStream, D-Stream and ClusTree based on synthetically
Radial Basis Function (RBF) based on generated data. Synthetic data is easier in
reproducing its storage and transmission costs are little [
            <xref ref-type="bibr" rid="ref32">32</xref>
            ].
          </p>
          <p>From the Fig. 3, it is shown that ClusTree has a quality with a virtually higher than
the D-Stream algorithm, and at the same range as CluStream. ClusTree is also a little
bit better than DenStream in SSQ and it is so clear from the results listed in Table 1
below too.</p>
          <p>The separation (BSS) is the between clusters sum of squared distances. The higher
the BSS, the better clustering quality. And it is shown in Fig. 4 (left) how much
CluStream is better than ClusTree. CluStream is from two to five times BSS better
than ClusTree.</p>
          <p>Fig. 4. BSS for 50000 points. ClusTree in red, CluStream (left) and DenStream
(middle), D-Stream (right) in blue.</p>
          <p>In Fig. 4 (middle), it is a comparison between ClusTree and DenStream. DenStream is
much better than ClusTree and it is corresponding to order of magnitude.</p>
          <p>In terms of BSS metric, ClusTree is only better than D-Stream and this is shown
clearly in Fig. 4 (right) and from Table 1 too. The values of BSS in clustering 50000
stream data points are 72.62, 217.82, 6.49 and 22.93 for CluStream, DenStream,
DStream and ClusTree respectively. Note that the D-Stream BSS value is the least one
so that it is the worse algorithm in terms of BSS performance metric.
In the introduction, it is indicated due to the unlimited rate growing large amounts of
data how much the need for such non-stopped strategies for analyzing and clustering
the non-stopped online coming stream objects especially in recent years becomes so
important. So that a wide set of techniques and strategies have been proposed for
analyzing and clustering stream. Also, in this paper additional challenges stream data
clustering are mentioned. In addition to explaining the most important performance
metrics used in comparing between clustering methods quality. Finally, experimental
results of these streaming methods have also been displayed.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          , M. Alhanjouri,:
          <article-title>New Density-Based Clustering Technique: GMDBSCAN-UR</article-title>
          , Islamic university of Gaza,
          <source>International Journal of Advanced Research in Computer Science, No. 1</source>
          ,
          <string-name>
            <surname>Jan</surname>
            <given-names>-Feb</given-names>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Huachun</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <article-title>Xiangning Hou and Zhong Yang,: Design of Intrusion Detection System Based on Improved K - means</article-title>
          <string-name>
            <surname>Algorithm</surname>
          </string-name>
          [J],
          <source>Computer Technology and Development</source>
          ,
          <year>2016</year>
          (
          <volume>01</volume>
          ):
          <fpage>101</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>C. C.</surname>
          </string-name>
          <article-title>Aggarwal,: A Survey of Stream Clustering Algorithms</article-title>
          , Chapter 10, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>C. C.</surname>
          </string-name>
          <article-title>Aggarwal.: A Framework for Diagnosing Changes in Evolving Data Streams</article-title>
          .
          <source>ACM SIG- MOD Conference</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>F.</given-names>
            <surname>Faranstorm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Elkan</surname>
          </string-name>
          .:
          <article-title>Scalability for Clustering Algorithms Revisited</article-title>
          .
          <source>ACM SIGKDD Explorations</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ): pp.
          <fpage>51</fpage>
          -
          <lpage>57</lpage>
          ,
          <year>2000</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich.: Introduction to Information Retrieval. Cambridge University Press.
          <source>ISBN 978-0-521-86571-5.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Al Abd Alazeez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jassim</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Du</surname>
          </string-name>
          ,:
          <article-title>EINCKM: An Enhanced Prototype-based Method for Clustering Evolving Data Streams in Big Data</article-title>
          ,
          <source>Proc. 6th Int. Conf. Pattern Recognit. Appl</source>
          . Methods, no.
          <source>Icpram</source>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>183</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. J. MacQueen,:
          <article-title>Some Methods for classification and Analysis of Multivariate Observations,” 5th Berkeley Symp</article-title>
          . Math. Stat. Probab.
          <year>1967</year>
          , vol.
          <volume>1</volume>
          , no.
          <issue>233</issue>
          , pp.
          <fpage>281</fpage>
          -
          <lpage>297</lpage>
          ,
          <year>1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,:
          <article-title>A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 2nd Int</article-title>
          . Conf. Know- ledge
          <string-name>
            <surname>Discov</surname>
          </string-name>
          .
          <source>Data Min.</source>
          , vol.
          <volume>2</volume>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Dempster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Laird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D. B.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,:
          <article-title>Maximum Likelihood from Incomplete Data via the EM Algorithm</article-title>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Stat</surname>
          </string-name>
          .
          <source>Soc. Ser. B (Methodological)</source>
          , vol.
          <volume>39</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jain</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>EY</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>: Adaptive non-linear clustering in data streams</article-title>
          ,
          <source>CIKM</source>
          , pp
          <fpage>122</fpage>
          -
          <lpage>131</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>S.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Meyerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>O'Callaghan</surname>
          </string-name>
          .:
          <article-title>Clustering data streams: Theory and practice</article-title>
          .
          <source>Trans. Know</source>
          . Eng.,
          <volume>15</volume>
          (
          <issue>3</issue>
          ):
          <fpage>515</fpage>
          -
          <lpage>528</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. D. Barbar´a.:
          <article-title>Requirements for clustering data streams</article-title>
          .
          <source>SIGKDD Explorations Newsletter</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>23</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Aggarwal</surname>
            <given-names>CC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Han</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>PS</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>: A framework for clustering evolving data streams</article-title>
          .
          <source>VLDB</source>
          , Berlin, pp
          <fpage>81</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          .:
          <article-title>Clustering Data streams on the Two-tier structure</article-title>
          .
          <source>Advanced Web Technologies and Applications</source>
          , pages
          <fpage>416</fpage>
          -
          <lpage>425</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>F.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .:
          <article-title>Density-based clustering over an evolving data stream with noise</article-title>
          ,
          <source>in SIAM Conference on Data Mining</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>328</fpage>
          -
          <lpage>339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>V.</given-names>
            <surname>Kanageswari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Dr.A.</given-names>
            <surname>Pethalakshmi</surname>
          </string-name>
          .
          <article-title>: A Novel Approach of Clustering Using COBWEB Department of Computer Science</article-title>
          , M.V.
          <article-title>Muthiah Government Arts College for Women, Dindigul Tamil Nadu - India.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tu</surname>
          </string-name>
          .:
          <article-title>Density-based clustering for real time stream data</article-title>
          ,
          <source>ACM KDD Conference</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Zhang</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishnan</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Livny</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>: BIRCH: an efficient data clustering method for very large databases</article-title>
          . SIGMOD, NY, USA.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Assent</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krieger</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glavic</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seidl</surname>
            <given-names>T</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>: Clustering multidimensional sequences in spatial and temporal databases</article-title>
          .
          <source>Knowl Inf Syst</source>
          <volume>16</volume>
          (
          <issue>1</issue>
          ):
          <fpage>29</fpage>
          -
          <lpage>51</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Jain</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>EY</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>: Adaptive non-linear clustering in data streams</article-title>
          ,
          <source>CIKM</source>
          , pp
          <fpage>122</fpage>
          -
          <lpage>131</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Lühr</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lazarescu</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>: Incremental clustering of dynamic data streams using connectivity based representative points</article-title>
          .
          <source>Data Knowl Eng</source>
          <volume>68</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Barbará</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>P</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>: Using the fractal dimension to cluster datasets</article-title>
          ,
          <source>KDD</source>
          , pp
          <fpage>260</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Chen</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tu L</surname>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>: Density-based clustering for real-time stream data</article-title>
          ,
          <source>KDD</source>
          , pp
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <given-names>P.</given-names>
            <surname>Kranen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Baldauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seidl</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Assent</surname>
          </string-name>
          ,
          <article-title>The ClusTree: indexing micro-clusters for anytime stream mining</article-title>
          , RWTH Aachen University, Aachen, Germany, Aarhus University, Aarhus,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>D. K. Tasoulis</surname>
            , G. Ross, and
            <given-names>N. M.</given-names>
          </string-name>
          <string-name>
            <surname>Adams</surname>
          </string-name>
          , “
          <article-title>Visualising the cluster structure of data streams,” in Proceedings of the 7th international conference on Intelligent data analysis, ser</article-title>
          .
          <source>IDA'07</source>
          . Berlin, Heidelberg: Springer-Verlag,
          <year>2007</year>
          , pp.
          <fpage>81</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27. David Arthur,
          <string-name>
            <given-names>Sergei</given-names>
            <surname>Vassilvitskii</surname>
          </string-name>
          .
          <article-title>: k-means++: The Advantages of Careful Seeding</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>A. Zhou</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Qian</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Jin</surname>
          </string-name>
          .:
          <article-title>Tracking clusters in evolving data streams over sliding windows</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          , vol.
          <volume>15</volume>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>214</lpage>
          , May
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Guttman</surname>
            <given-names>A</given-names>
          </string-name>
          (
          <year>1984</year>
          )
          <article-title>R-trees: A Dynamic Index Structure for Spatial Searching</article-title>
          . SIGMOD, Boston, pp
          <fpage>47</fpage>
          -
          <lpage>57</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Spiliopoulou</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ntoutsi</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theodoridis</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schult</surname>
            <given-names>R</given-names>
          </string-name>
          (
          <year>2006</year>
          )
          <article-title>Monic: Modeling and Monitoring Cluster Transitions</article-title>
          , KDD, pp
          <fpage>706</fpage>
          -
          <lpage>711</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          , G. Holmes,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kirkby</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Pfahringer,</surname>
          </string-name>
          (
          <year>2010</year>
          ),
          <source>MOA: Massive Online Analysis; Journal of Machine Learning Research</source>
          <volume>11</volume>
          :
          <fpage>1601</fpage>
          -
          <lpage>1604</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          , G. Holmes,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kirkby</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Pfahringer.</surname>
          </string-name>
          :
          <article-title>Data Stream Mining: A Practical Approach</article-title>
          , May
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>