<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Grace { Limiting the Number of Grid Cells for Clustering High-Dimensional Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Beer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniyal Kazempour</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julian Busch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Tekles</string-name>
          <email>alexander.tekles@campus.lmu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Seidl</string-name>
          <email>seidlg@dbs.ifi.lmu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ludwig-Maximilians-Universitat Munchen</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Using grid-based clustering algorithms on high-dimensional data has the advantage of being able to summarize datapoints into cells, but usually produces an exponential number of grid cells. In this paper we introduce Grace (using a Gr id which is adaptive for clustering), a clustering algorithm which limits the number of cells produced depending on the number of points in the dataset. A non-equidistant grid is constructed based on the distribution of points in one-dimensional projections of the data. A density threshold is automatically deduced from the data and used to detect dense cells, which are later combined to clusters. The adaptive grid structure makes an e cient but still accurate clustering of multidimensional data possible. Experiments with synthetic as well as real-world data sets of various size and dimensionality con rm these properties.</p>
      </abstract>
      <kwd-group>
        <kwd>Grid-based</kwd>
        <kwd>Clustering</kwd>
        <kwd>High-dimensional</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Clustering is one of the most important and well investigated unsupervised data
mining tasks. Nevertheless, some problems related to the curse of dimensionality
are still not solved. Grid based approaches su er not only from the exponentially
increasing number of cells in relation to the number of dimensions, but also from
the incoherence between data and grid structure. As many real-world datasets
have high dimensional feature spaces, being able to handle many dimensions is
quite important for clustering algorithms.</p>
      <p>Even though subspace clustering algorithms focus on high-dimensional data,
they assume clusters to be in a low dimensional subspace of the data and are thus
not suitable to nd clusters lying in the full-dimensional space. Density based
approaches on the other hand nd clusters in full-dimensional space where all
dimensions are equally important, but cannot handle high-dimensional data. Thus,
e cient clustering of high-dimensional data without any a priori knowledge is
still a huge challenge.</p>
      <p>Hence we developed a grid-based clustering approach for high-dimensional
data. It works fully automatically and nds clusters in full-dimensional data
space without creating an exponential number of cells. The grid adapts itself to
the data in regards of cell size as well as individual number of cells per dimension
by using information from one-dimensional projections of the data. This leads
not only to an adequate quality of the clustering results, but at the same time
facilitates high e ciency of the algorithm.</p>
      <p>Our main contributions are as follows:
{ We develop Grace, a new grid-based clustering algorithm.
{ By constructing the adaptive grid gradually, we are, to the best of our
knowledge, the rst ones to circumvent an exponential number of grid cells in
relation to the number of dimensions.
{ Grace is e cient and detects clusters of arbitrary shape in high-dimensional
space.</p>
      <p>The rest of the paper is structured as follows. Section 2 provides an overview
of related work in the eld of density-based and grid-based clustering. The
algorithm itself is described in Section 3 and evaluated theoretically as well as
empirically in Section 4. A brief conclusion is nally provided in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Since there exists a wealth of literature in the eld of density-based as well as
gird-based clustering, this section aims to provide an overview on some of the
existing methods. We shall provide a brief elaboration on the core ideas behind
each of the methods revealing the distinctive properties of our method in contrast
to the competitors.
2.1</p>
      <sec id="sec-2-1">
        <title>Density-based Clustering Approaches</title>
        <p>
          Density-based clustering methods detect dense regions which are enclosed and
separated by sparse regions and are thus suitable to nd arbitrarily shaped
clusters. Most density-based methods rely on local densities based on distances
in the full-dimensional data space. The most common density-based approach
is DBSCAN [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which considers points with at least minP ts points in their
"-range as core points. Core points are connected if their distance is lower than
" and form a cluster. Not-core points either lie in the "-range of a core point and
get assigned to the cluster of the core point, or are declared noise. DBSCAN is
quite sensitive to " and minP ts, which are two parameters hard to guess for a
user without detailed knowledge of the data. OPTICS [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] improves DBSCAN
by introducing a reachability plot based on minP ts, on which users can see the
cluster structure and choose appropriate ".
        </p>
        <p>
          DENCLUE [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] uses local densities to compute an overall density function,
the maxima of which constitute density attractors. Every object is connected
to such a density attractor by means of a hill-climbing procedure. A threshold
gives the minimum density level for a density-attractor which allows to nd
noise. To accelerate the calculation of local densities, a simple grid with the same
cell width 2 in all dimensions is used, where is a user given parameter.
        </p>
        <p>
          Other density-based clustering algorithms usually build on the approaches
presented so far and aim to improve or extend them. HDBSCAN [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for
example extends DBSCAN to a hierarchical approach allowing di erent levels of
density that can detect clusters of di erent density or nested clusters,
overcoming the aforementioned issue of one global hyperparameter setting. DeLiClu [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
and SOPTICS [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] represent algorithms with purposes similar to OPTICS, but
with improvements regarding their e ciency. Likewise, DENCLUE 2.0 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is a
straightforward improvement of DENCLUE with reduced runtime complexity.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Grid-based and Subspace Clustering Approaches</title>
        <p>
          Grid-based clustering methods generally partition the data into cells of di erent
densities by dividing each dimension into several intervals. Those cells can then
be connected into clusters without having to look at each data point again, which
decreases the runtime. The results are often highly dependent on the structure
of the constructed grid and most algorithms require users to set the de ning
parameters. STING [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] proposes a quite interesting hierarchical grid structure
based on statistical information, but is neither used for clustering, nor does it
deliver exact values, but rather approximations. Also, the distribution type of
the data has to be known or ascertained by hypothesis tests.
        </p>
        <p>Most grid-based clustering techniques produce subspace clusterings i.e. they
detect subspaces of a high-dimensional data space which contain clusters of the
given data. Grid-based approaches are well-suited for this task because they can
easily exploit the monotonicity of the clustering criterion regarding
dimensionality. This criterion implies that a k-dimensional cell is dense only if every (k
- 1)-dimensional projection of the cell is also dense, given a constant density
threshold for the number of objects in a cell.</p>
        <p>
          One of the rst clustering algorithms to implement a grid-based subspace
clustering was CLIQUE [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. After constructing a grid with the same number of
equidistant intervals in each dimension and identifying the dense cells, CLIQUE
employs a bottom-up approach to nd subspaces with dense regions by joining
cells in k-dimensional spaces to candidate cells in (k + 1) dimensions. If the
number of objects in such a candidate cell exceeds a given density threshold, the
corresponding (k + 1)-dimensional space is considered a relevant subspace. After
detecting the relevant subspaces, CLIQUE connects adjacent dense cells in their
corresponding subspaces.
        </p>
        <p>However, CLIQUE does not take the data distribution into account for
generating the grid structure or for nding the dense regions. One subspace clustering
approach that considers the data distribution beforehand is FIRES, which
detects clusters on the basis of a greedy heuristics merging one-dimensional clusters
in order to nd approximations of subspace clusters. This signi cantly reduces
the runtime complexity of FIRES compared to CLIQUE.</p>
        <p>
          While FIRES does not employ a grid structure at all, MAFIA [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
incorporates data distribution by using an adaptive grid in order to produce a better
partitioning of the dimensions and reduce the number of grid cells. MAFIA
determines the intervals in each dimension on the basis of one-dimensional
histograms. Adjacent bins of a histogram are joined if they have approximately the
same frequency. This yields larger intervals in the dimensions, each with roughly
constant (one-dimensional) density. Nevertheless, MAFIA requires two
parameters that may have a signi cant impact on the results and there is no guaranteed
bound on the number of grid cells. On its adaptive grid, MAFIA proceeds like
CLIQUE.
        </p>
        <p>
          A general framework for clustering high-dimensional data on the basis of
an adaptive grid is provided by OptiGrid [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] which recursively splits the data
set by means of separating hyperplanes which should cut through low-density
regions and separate high-density regions. The resulting cells already represent
clusters, given a su ciently high density. Though di erent approaches exist for
selecting suitable hyperplanes [
          <xref ref-type="bibr" rid="ref10 ref6">10, 6</xref>
          ], these methods are not able to detect
arbitrarily shaped clusters since generated cells already represent clusters and are
not combined. Further, these methods require setting parameters whose impact
on the result is di cult to assess a priori.
        </p>
        <p>
          Further grid-based methods include SCHISM [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] which addresses the
question of how to de ne and detect statistically interesting subspaces in
highdimensional data. As a measure for interestingness, the authors rely on the
Cherno -Hoe ding bound and use it for pruning. WaveCluster [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] relies on
discrete wavelet transformation. The data is mapped to the frequency domain
where clusters are then found by detecting dense regions. The method is
insensitive to outliers and has a runtime complexity linear in the number of data
objects. Among the most recent grid-based methods, ITGC [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is an information
theoretic approach regarding clustering as a data compression task. As such,
neighboring grid cells are merged if it is bene cial with respect to compression
costs.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>E cient Grid-based Clustering of Multi-dimensional</title>
    </sec>
    <sec id="sec-4">
      <title>Spaces</title>
      <p>In this section we describe Grace in detail. In 3.1 we explain how the
nonequidistant regular grid is generated dependent on the respective dataset. Section
3.2 shows how dense grid cells are combined to form clusters.
3.1</p>
      <sec id="sec-4-1">
        <title>Generation of Adaptive Grids</title>
        <p>The grid generation process is designed to limit the number of generated cells
depending on the number of data points N and still allow for an accurate
detection of clusters. For that we rst estimate the density of each dimension
separately and then split the data space iteratively based on these estimations
until the number of cells exceeds N log(N ). To reduce runtime for
calculating local changes in one-dimensional densities, we consider a histogram with
b = max(50; pN=d) equi-width bins for every dimension instead of all points
separately. A maximum of 50 bins for each histogram has shown to produce
appropriate grid structures for various data distributions and di erent numbers
of points N . Higher N imply possibly more complex shaped clusters requiring
a higher granularity of the histogram. A higher number of dimensions d in
contrast results in a lower number of bins, since high dimensionality implies less
expressiveness of distances and less bins allow for higher deviations.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Estimation of Local Changes in</title>
        <p>One-Dimensional Densities Next,
we compute for each bin in all
onedimensional histograms a local change
indicator to express the local change
of the one-dimensional density. To
this end, we measure local changes 1
in density as di erences of
frequencies between areas left and right to
a particular bin and additionally set
them in relation to the frequencies in
their respective areas to distinguish 0:5
random variations from relevant
density shifts. The relevant neighborhood
left and right of a particular bin is
determined dynamically based on the 0
frequency f covered by this area. The 0 0:5 1
idea is to add less bins if the local
density is already high, such that the Fig. 1: Two-dimensional data with
hisseparating hyperplanes will be lying tograms as approximations of the
onecloser together. Adjacent bins left and dimensional data projections and edges
right are added iteratively until f ex- chosen respectively
ceeds a threshold t which is adjusted
after each step. The threshold
primarily depends on f and N such that a
certain fraction of all objects needs to lie within the neighborhood range to stop
expanding it. To avoid building large low-density cells, the neighborhood
frequency is further weighted with the width of the current neighborhood range h,
leading to a threshold
t =</p>
        <p>1
h (1=b) (N=d)</p>
        <p>N:
(1)</p>
        <p>These weighted frequencies are used both for computing the di erences
between left and right areas as well as for determining the neighborhood area. As a
consequence, the density close to a bin has more impact on these computations
than more distant bins. Figure 1 illustrates the generation of an adaptive grid
based on the one-dimensional data projections as described so far.
Selection of Intervals After the one-dimensional histograms are computed and
a change indicator is assigned to each bin, separating hyperplanes are selected
iteratively until the number of cells generated by these planes is greater than or
equal to N log(N ). If d is high, not all dimensions are considered in order to limit
the number of cells, i.e. some dimensions are not split by a separating plane. A
dimension is only considered if at least two edges are chosen for this dimension.
With only one edge in a dimension, i.e. two intervals, all data points would lie in
the same interval or in adjacent intervals with respect to this dimension. In both
cases, the corresponding dimension would have no informative content. This grid
generation method yields at max 3 N log(N ) cells (see Theorem 1).
Theorem 1. Given a d-dimensional data set with N objects. A regular grid in
the d-dimensional space that is constructed by iteratively adding cutting
hyperplanes, consists of at most 3N cells, if the grid generation process is stopped as
soon as the number of cells is greater than or equal to N.</p>
        <p>Proof. Suppose hdim edges are already chosen for dimension dim 2 1; :::; d. If
hk 1, dimension k is not yet considered due to the minimum of two edges in
a dimension for it to be considered. Let Dc 1; :::; d be the set of indices of
the dimensions already considered. Dimension k is divided into bk = (hk + 1)
intervals. The number of cells c in the grid is computed by multiplying the
number of intervals in all dimensions:
c =</p>
        <p>Y
Case 2: hk = 0</p>
        <p>If hk = 0, the number of cells does not increase at all, as dimension k is not
considered yet and will not after this iteration. For the relationship between
c and c0 it holds, that: c0 = c &lt; N &lt; 3N
Case 3: hk = 1</p>
        <p>If hk = 1, k is a new dimension to be considered, as h0k = hk + 1 = 2 and
b0k = bk + 1 = 3 respectively. The number of cells therefore increases by the
factor 3. h0k = 2 ) k 2 Dc
c0 =</p>
        <p>Y</p>
        <p>Every bin of the initial histogram represents a potential edge for splitting.
In every iteration, the bin with the maximum local change indicator is chosen
as the next edge. To choose edges closer to cluster borders, we make a small
adjustment after selecting an edge: The edge is shifted in one direction as long
as two successive bins have a frequency di erence below 5% and we choose the
direction to which the edge needs to be shifted less. After selecting an edge,
we discard all bins within the neighborhood if the selected bin from the set of
potential edges. This step avoids high granularity of the grid in areas of density
changes. Algorithm 1 summarizes the grid creation.</p>
        <p>Algorithm 1 CreateGrid
Given the generated grid, the next steps involve detection of dense grid cells
and subsequent combination of adjacent dense cells. For Grace, we apply wider
notion of adjacency than existing works: Coordinates may di er at most by
one in all dimensions { compared to a narrow notion, where the coordinates of
adjacent bins may di er by one only in exactly one dimension.</p>
        <p>A cell of volume V is considered dense, if it contains more than minP ts=V
points for a minP ts given by the user. To avoid setting the density threshold too
high for clusters with lower density, we rst determine the most dense cells. To
this end, we identify all cells containing more points that they would in
expectation assuming a uniform distribution. In a second step, we discard these cells
and detect the remaining dense cells with lower density using a new threshold.</p>
        <p>Detection of dense cells is straightforward. After discretizing all data points
to the grid, the number of data points in each grid cell is counted and compared
to the threshold of the particular grid cell. Given an adequate grid structure,
the number of dense cells p is usually much lower than N . Adjacent dense cells
are now connected to form clusters by extracting connected components from
the graph represented by the symmetric p p adjacency matrix M with Mi;j
if and only if cells i and j di er by exactly one dimension. Adjacent cells can
be found by sorting the dataset in every dimension and then iterating over the
dimensions.
So far, only adjacent cells in the narrower sense have been connected. To connect
cells adjacent in the wider sense, i.e., diagonally adjacent cells, we add additional
helper cells next to dense cells. Given a cell ~a, that is either a dense cell or a
previously added helper cell with coordinates (a1; a2; :::; ad) and the order of
dimensions considered for the current sorting of the coordinates di1 ; di2 ; :::; did ,
a new helper cell (b1; b2; :::; bd) with bk = ak8k 2 fi1; :::; id 1g and bid = aid + 1
is now added to the set of helper cells if it has not yet been added before. New
helper cells are not considered in the same iteration they were added, but in
subsequent iterations they are treated just like the original dense cells. This
ensures to nd exactly all connections between originally adjacent dense cells in
the wider sense.</p>
        <p>Theorem 2. Given a d-dimensional grid with a set P of dense cells and an
initially empty set of helper cells H, both of whom are iterated d times. In every
iteration i, a cell b with coordinates bk = ak8k 2 f1; :::; dg n i for each cell
a 2 P [ H that has no adjacent cell with the coordinates of b is added to the set
of helper cells after the current iteration. With this procedure, two dense cells
that are adjacent in the wider sense can be connected, either directly or indirectly
with the help of other cells in P [ H, by just applying the notion of adjacency in
the narrow sense.</p>
        <p>Proof. Given two dense cells a and b with coordinates (a1; a2; : : : ; ad) and (b1; b2;
:::; bd), respectively.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Case 1: a and b are adjacent in the narrow sense</title>
        <p>Adjacency in the narrow sense implies adjacency in the wider sense by de
nition, thus the two cells are also adjacent in the wider sense.</p>
        <p>Case 2: ai 2 fbi; bi 1g; 8i 2 f1; :::; dg</p>
        <p>Suppose the cells' coordinates di er in dimensions fj1; j2; :::; jmg f1; 2; :::; dg
with js &lt; jt 8s &lt; t. In iteration j1, the cell a(1) with coordinates (a1; :::; aj1 +
1; :::; ad) is either added as helper cell or already a dense cell. Since this cell
di ers by one from a in exactly one dimension, it is adjacent to a in the narrow
sense. In iteration j2, the cell a(2) with coordinates (a1; :::; aj1 + 1; :::; aj2 +
1; :::; ad) is again either added as helper cell or already a dense cell. The new cell
is now adjacent to the previously added cell a(1). This step is then repeated for
all j 2 fj1; j2; :::; jm 1g. Finally, the cell a(m 1) with coordinates (a1; :::; aj1 +
1; :::; aj2 + 1; :::; ajm 1 ; :::; ad) is either added as helper cell or already a dense
cell. a(m 1) di ers in exactly one dimension from b, so that a(m 1) and b are
adjacent in the narrow sense and thus a and b.</p>
        <p>Case 3: ai 2 fbi; bi + 1g; 8i 2 f1; :::; dg</p>
        <p>Switching a and b converts this case to the same as case 2.</p>
        <p>Case 4: ai 2 fbi; bi 1g; 8i 2 I f1; :::; dg and aj 2 fbj; bj + 1g8j 2 J =
fj1; j2; :::; jmg f1; :::; dg n I
In this case, two chains of adjacent cells can be constructed, each following
the same idea as in case 2 or in case 3 respectively. The rst chain starts
from cell a, considering all dimensions with ai = bi + 1, which corresponds
to case 2. The second chain starts from cell b, considering all dimension with
ai = bi 1, which corresponds to case 3. Finally, without loss of generality, the
coordinates of the last cell a~ in the rst chain are of the form (~a1; a~2; :::; a~d)
with a~i = a~i + 1 = bi8i 2 I and a~k = bk8k 2 f1; :::; dg n I. The coordinates
of the last cell ~b in the second chain are of the form ~b1; ~b2; :::; ~bd with ~bj =
~bj + 1 = aj8j 2 fj1; :::; jm 1g, ~bjm = ajm + 1 and ~bk = ak8k 2 f1; :::; dg n J .
Now, these two last cells of both chains di er only in dimension jm by one, so
that they are connected in iteration jm.
tu
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>We investigate Grace from a theoretical as well as from an empirical point of
view. In Section 4.1 we calculate the runtime complexity and in 4.2 we present
and discuss several experiments based on synthetic as well as real world data
sets and compare the results with DBSCAN and CLIQUE.
4.1</p>
      <sec id="sec-5-1">
        <title>Complexity Analysis</title>
        <p>The histograms for each dimension can be computed in time O(N d). For each of
the b d histogram bins, the local change indicator can be computed in constant
time, leading to O(b d). Having b pN d, we get O(pN d d) = O(pN d2).
Finding separating hyperplanes requires sorting the b d local change indicators,
which can be done in time O(b d log(b d)) = O(pN d d log(pN d d)) =
O(N d2).</p>
        <p>The dense cells are determined by counting for every point the occurrences of
each coordinate combination, which is in O(N d). Dense cells can be combined
e ciently by sorting them in every dimension to identify adjacent grid cells in
that dimension. Since the maximum number of dense cells is N log(N ), the
complexity of sorting the dense cells is O(N log(N ) log(N log(N )) d) =
O(N log2(N ) d). Adjacent dense cells can be identi ed in each dimension by
comparing consecutive cells in the sorted order with complexity O(N d). Thus,
all connections between dense grid cells can be detected in time O(N log2(N ) d).
Connected components can be identi ed in time O(N 2 log2(N )). In total, the
complexity is thus
O(N d + pN d2 + N d2 + N log2(N ) d + N 2 log2(N )) = O(N 2 log2(N ) d2):
Note, that the number of dense cells p which is responsible for the N 2 log2(N )
part is usually far lower than N .
4.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Empirical</title>
        <p>
          The following experiments have been conducted on a Linux machine with a
commodity hardware featuring a 2.0 GHz CPU with two cores and 3.6 GB RAM.
As Grace uses elements from density-based as well as grid-based clustering, we
compare our results to those DBSCAN and CLIQUE. Where Grace works fully
automatically, DBSCAN and CLIQUE both need two parameters, for which
those yielding the best results were chosen. For DBSCAN we used the
scikitlearn Python library implementation and for CLIQUE the implementation from
the data mining framework ELKI [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>
          For a rst visual interpretation, we show the e ectivity of Grace working on
two simple two-dimensional synthetic data sets containing density-based
clusters and compare the results to those of DBSCAN. Figure 2 shows the
clustering result for \TARGET" [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] with N = 770, consisting of four small clusters
distributed at opposing corners as seen in Figure 2, one of them being detected
as noise (bottom left cluster). Apart from the detected noise cluster, all other
clusters are detected correctly by Grace. With a proper parameter setting,
DBSCAN is capable of detecting all clusters, too. \CLUTO-T8-8K" contains 8000
two-dimensional objects [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Applied to this data set, the Grace found some,
but not all clusters similar to DBSCAN.
        </p>
        <p>Another synthetic data set with N = 101; 000 and d = 9 containing three
spherical clusters and 1; 000 noise objects has been created to show the scalability
of Grace. The clusters of this data set are found in mere 4.5 seconds, where,
due to excessive memory consumption, neither DBSCAN nor CLIQUE could
be applied to this data set with the machine used here. To compare at least
CLIQUE with the proposed approach in high dimensions, another data set with
0:6
0:4
0:2
0</p>
        <p>0:2 0:4 0:6 0:8
(a) TARGET dataset
1
0</p>
        <p>0:2 0:4 0:6 0:8
(b) CLUTO-T8-8K
1</p>
        <p>N = 100; 000 and d = 8, also containing three spherical clusters, is clustered by
the two algorithms. They both detect all three clusters, where Grace was almost
three times as fast as CLIQUE (1.7 vs 4.8 seconds).</p>
        <p>
          Finally, a data set derived from the \VICON" data set containing physical
action data measuring human activity [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] has been tested. The actions are
measured by means of nine sensors on di erent body parts, each emitting
threedimensional spatial data. In summary, this yields a data set with 27 dimensions.
For this experiment, the two actions punch and handshake have been merged
into one data set with 5045 objects. Again, neither the DBSCAN implementation
nor the CLIQUE implementation used can be applied to this data set on the
machine used due to excessive memory consumption. Grace detected an accurate
clustering within 232ms, with one cluster being detected 100% and the other
cluster being split with 92% of it being grouped in one cluster.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Grace nds clusters in multidimensional data spaces where points build a cluster
if they are close in all dimensions. It generates an adaptive grid structure that
makes it possible to reduce the runtime complexity signi cantly for
multidimensional data spaces compared to similar grid-based approaches. The experimental
evaluation has shown that the algorithm outperforms DBSCAN and CLIQUE
for large data sets and high dimensions. Grace works fully automatically and
can be applied to datasets of various sizes, dimensionalities, and cluster
densities. For clustering high-dimensional data with all dimensions being relevant for
forming the clusters, it is an e cient alternative to established algorithms. It is
moreover a possibility to get some rst insights if no information about the data
is available yet, since no expert knowledge about the data is needed beforehand
due to the absence of any parameters. In future work noise should be handled
separately and we are also investigating the suitability for anytime results. The
grid construction is promising for many other applications, and could, e.g., be
applied in context of arbitrarily oriented correlation clusters.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by the German Federal Ministry of
Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this
work take full responsibilities for its content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Achtert</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Bohm,
          <string-name>
            <surname>C.</surname>
          </string-name>
          , Kroger, P.: Deliclu:
          <article-title>Boosting robustness, completeness, usability, and e ciency of hierarchical clustering by a closest pair ranking</article-title>
          .
          <source>In: PAKDD</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gehrke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gunopulos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Automatic subspace clustering of high dimensional data for data mining applications</article-title>
          .
          <source>SIGMOD</source>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ankerst</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breunig</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriegel</surname>
            ,
            <given-names>H.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
          </string-name>
          , J.: Optics:
          <article-title>Ordering points to identify the clustering structure</article-title>
          .
          <source>In: SIGMOD</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Behzadi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinterhauser</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plant</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Itgc:
          <article-title>Information-theoretic grid-based clustering</article-title>
          .
          <source>In: EDBT</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Campello</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moulavi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
          </string-name>
          , J.:
          <article-title>Density-based clustering based on hierarchical density estimates</article-title>
          . In: Pei,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Tseng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Motoda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (eds.)
          <source>PAKDD</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <issue>6</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>D.S.:</given-names>
          </string-name>
          <article-title>A new cell-based clustering method for large, highdimensional data in data mining applications</article-title>
          . In: SAC (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ester</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriegel</surname>
            ,
            <given-names>H.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>X.:</given-names>
          </string-name>
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          .
          <source>In: KDD</source>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Goil</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagesh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choudhary</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Ma a: E cient and scalable subspace clustering for very large data sets</article-title>
          .
          <source>In: KDD</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hinneburg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Gabriel, H.H.:
          <article-title>Denclue 2.0: Fast clustering based on kernel density estimation</article-title>
          .
          <source>In: IDA</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hinneburg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering</article-title>
          .
          <source>In: VLDB</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hinneburg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keim</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>An e cient approach to clustering in multimedia databases with noise</article-title>
          .
          <source>In: KDD</source>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Karypis</surname>
            , G., Han,
            <given-names>E.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Chameleon: A hierarchical clustering algorithm using dynamic modeling</article-title>
          .
          <source>IEEE Computer</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lichman</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>UCI machine learning repository (</article-title>
          <year>2013</year>
          ), http://archive.ics.uci.edu/ml
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Muntz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Sting: A statistical information grid approach to spatial data mining</article-title>
          .
          <source>In: VLDB</source>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlachos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Scalable density-based clustering with quality guarantees using random projections</article-title>
          .
          <source>DMKD</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Schubert</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Elki: A large open-source library for data analysis-elki release 0.7. 5" heidelberg"</article-title>
          . arXiv preprint arXiv:
          <year>1902</year>
          .
          <volume>03616</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Sequeira</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Schism: A new approach for interesting subspace mining</article-title>
          . In: ICDM (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sheikholeslami</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , A.:
          <article-title>Wavecluster: A multi-resolution clustering approach for very large spatial databases</article-title>
          .
          <source>In: VLDB</source>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ultsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Clustering with som, u*c</article-title>
          . In: WSOM (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>