=Paper= {{Paper |id=None |storemode=property |title=Subspace Clustering with Gravitation |pdfUrl=https://ceur-ws.org/Vol-581/gvd2010_1_1.pdf |volume=Vol-581 |dblpUrl=https://dblp.org/rec/conf/gvd/Zhao10 }} ==Subspace Clustering with Gravitation== https://ceur-ws.org/Vol-581/gvd2010_1_1.pdf
                       Subspace clustering with gravitation ∗

                                                           Jiwu Zhao
                                                  Heinrich-Heine University
                                                Institute of Computer Science
                                             Databases and Information Systems
                                                       Universitaetsstr.1
                                               D-40225 Duesseldorf, Germany
                                              zhao@cs.uni-duesseldorf.de

ABSTRACT                                                           analysis, bio science, fraud detection and so on.
Data mining is a process of discovering and exploiting hid-
den patterns from data. Clustering as an important task of         Subspace clustering enables clustering in subspaces within
data mining divides the observations into groups (clusters),       a data set, which means that the clusters could be found in
which is according to the principle that the observations in       subspaces rather than only in the whole space.
the same cluster are similar, and the ones from different clus-
ters are dissimilar to each other. Subspace clustering enables
clustering in subspaces within a data set, which means the         1.1    Related works
clusters could be found not only in the whole space but also       In a review of subspace clustering [12], the subspace cluster-
in subspaces. The well-known subspace clustering methods           ing algorithms are categorized into two groups: top-down
have a common problem, the parameters are hard to be de-           search and bottom-up search methods. Top-down methods
cided. To face this issue, a new subspace clustering method        like PROCLUS[1], ORCLUS[2], FINDIT[15], σ-Clusters[16],
based on Bottom-Up method is introduced in this article. It        COSA[5] use multiple iterations for improving the cluster-
takes a gravitation function to select data and dimensions         ing results. By contrast, bottom-up methods find firstly
by using self-comparison technique. The parameter decision         clusters in lower subspaces, and then expand the searching
is easy, and does not depend on amount of the data, which          by adding more dimensions. Some examples are CLIQUE
makes the subspace clustering more practical.                      [3], ENCLUS [4], MAFIA [6], CBF [10], CLTree [11], DOC
                                                                   [13].
Keywords                                                           No matter in which group, almost all subspace clustering
data mining, subspace clustering, gravitation, high dimen-         algorithms have a common problem with finding appropriate
sional data                                                        values for their parameters. For instance, most of top-down
                                                                   methods have to estimate parameters like number of clusters
1.   INTRODUCTION                                                  and subspaces, the clustering results are improved by the
Because of the modern techniques, data collection is nowa-         iterations that are based on these parameters, which are
days efficient and cost-effective. The data’s amount is huge       absolutely not easy to estimate. In bottom-up methods, the
and most data is stored in a raw form, which is not ana-           key parameters such as density, grid interval, size of clusters,
lyzed yet. Usually we need to find out unknown or hidden           etc. are also hard to be determined. It is necessary to find a
information from raw data. Data mining is such a process of        method that determines parameters easily, in order to make
discovering and exploiting hidden patterns from data. It in-       the clustering job more practical.
volves clustering, classification, regression, association, etc.
                                                                   DENCLUE [9] is a density-based clustering algorithm by us-
Clustering divides the observations into groups (clusters), so     ing Gaussian kernel function as its abstract density function
that the observations in the same cluster are similar, mean-       and hill climbing method to find cluster centers. DENCLUE
while, the ones from different clusters are dissimilar. Clus-      2.0 [8] is an improvement on DENCLUE. The algorithms
tering is important for data analysis, such as market basket       differ from other density-based approaches in that they cal-
∗Copyright is held by the author/owner(s).                         culate density to each data point instead of an area in the
                                                                   attribute space. DENCLUE has not to estimate the num-
GvD Workshop’10, 25.-28.05.2010, Bad Helmstedt, Ger-
many.                                                              ber or the position of clusters, because clustering is based on
                                                                   the density information of each point. However it is still nec-
                                                                   essary to estimate the parameters in these two algorithms,
                                                                   such as σ, ξ in DENCLUE and , p in DENCLUE 2.0. Be-
                                                                   sides, they are not designed for subspace clustering.

                                                                   Applying the Newton’s universal law of gravitation in clus-
                                                                   tering is not a novel idea. A gravitational clustering algo-
                                                                   rithm [7] simulates the movement of objects by applying
                                                                   the gravitational force, and detects clusters from merged
                                                                   objects. A shrinking-based approach [14] inspired by gravi-
tation is a grid-based clustering method, which shrinks the                               where Ae ⊆ A and O  e ⊆ O, and S must satisfy a particular
objects in a grid cell towards the data centroid and finds                                condition C, which is defined differently in every subspace
the clusters. However, for each algorithm we have to find                                 clustering algorithm.
appropriate values for its parameters.
                                                                                          A subspace cluster S could then be written like this:
1.2        Contributions of the paper                                                           S = (A,
                                                                                                     e O)
                                                                                                        e = ({a1 , a12 , a60 , · · · }, {o1 , o5 , o30 , · · · })
In this paper, we introduce a new density-based bottom-up
subspace clustering method called SUGRA (SUbspace clus-
tering method by using GRAvitation’s function). The basic                                 The cardinality regarding the objects and the dimensions in
idea is similar to DENCLUE, instead of using the Gaussian                                 S are defined respectively:
kernel function, we use gravitation’s function with scaled
distance to represent the density function, but the objects                                                     |S|O = |O|,
                                                                                                                        e         |S|A = |A|
                                                                                                                                          e                         (6)
don not have to move towards the centroid. From this sim-
ple idea we have found an interesting property that in one
dimensional subspace almost all cluster objects and non-                                    Remark 1. Suppose S1 , S2 are two subspace clusters, where
cluster objects (noise) are separated by a constant. With                                 S1 = (A1 , O1 ) and S2 = (A2 , O2 ), then
this property, we can detect clusters very distinctly, mean-
while, SUGRA realizes the reduction of parameters by sub-                                    • If A1 6= A2 ∨ O1 6= O2 =⇒ S1 6= S2 , the subspace clus-
space clustering.                                                                              ters with different dimensions or objects are considered
                                                                                               as different ones.
The remainder of this paper is organized as follows. The
idea of SUGRA is introduced in section 2, where section 2.1                                  • ∀A0 ⊆ A1 , S 0 = (A0 , O1 ) is also a subspace cluster.
and 2.2 are definitions about cluster and gravitation function
respectively, section 2.3 presents the algorithm of SUGRA.                                   • If A1 ⊆ A2 ∧ O1 = O2 or A1 = A2 ∧ O1 ⊆ O2 =⇒
The last section contains conclusions and areas of future                                      S1 < S2 . Only the largest subspace cluster will be taken
work.                                                                                          in the clustering result.

2.  SUBSPACE CLUSTERING WITH GRAV-                                                          Definition 3. The intersection of two subspace clusters is
    ITATION                                                                               defined as follows:
2.1 Definition of data set and subspace cluster                                                             S1 ∩ S2 = (A1 ∪ A2 , O1 ∩ O2 )                          (7)
A data set consists of objects and their attributes. Usually,
all objects have common attributes in a data set, such as
color, price, length etc., and every object has a value for an                               Definition 4. S ai is the set of all subspace clusters found
attribute. The values that are related to an attribute are                                in dimension ai , S D is the set of all subspace clusters found
in the same domain and conform to the same constraints.                                   in D, finding S D is the task of subspace clustering.
A data set could be described as a table, where the objects
are just rows, meanwhile the attributes are columns. The
attributes could also be considered as dimensions, so that                                2.2     Gravitation
each attribute represents one dimension, and then the ob-                                 Gravitation is a natural phenomenon, which describes the
jects are points in these dimensions.                                                     force of attraction between objects with mass. The gravita-
                                                                                          tion is important, because it influences our normal lives.
In order to describe the attributes and objects clearly, they
are defined as follows:                                                                   The Newton’s law of universal gravitation is defined as fol-
                                                                                          lows:
                                                                                                                           m1 m2
  Definition 1. (Data set) Generally, a data set D could be                                                       G=G·                             (8)
                                                                                                                             r2
considered as a pair, which is the combination of A and O:
                                                                                          where G is the gravity between two point masses, G is the
                                     D = (A, O)                                     (1)   gravitational constant, m1 and m2 are the masses of two
where A is a set of all attributes (dimensions), and O is a                               points respectively, r is the distance between the two point
set of all objects:                                                                       masses.

     A = {a1 , a2 , · · · , ai , · · · },     O = {o1 , o2 , · · · , op , · · · }   (2)   SUGRA tries to use the gravitation’s function for the mea-
where op is an object with values on A:                                                   surement between the objects. In order to make the calcu-
                                                                                          lation easier, a simple gravity function is used here:
                        op = {oap1 , oap2 , · · · , oapi , · · · }                  (3)
We denote the values of all objects on attribute ai with:
                                                                                            Definition 5. (Simple gravity function)
                        oai = {oa1 i , oa2 i , · · · , oapi , · · · }               (4)                                 mp mq
                                                                                                                Gapqi =   2
                                                                                                                                                                    (9)
                                                                                                                         rpq
   Definition 2. (Subspace cluster) A subspace cluster S is
also a data set and defined as follows:                                                   Suppose that a single object op has a mass mp = 1, rpq is the
                                                                                                                                              lpq
                      S=D  e = (A,e O)                  (5)                               distance between op and oq is defined as rpq =              ,
                                     e                                                                                                    L/(N − 1)
                                                (a)                                                                  (b)
                 3.5                                                                  70
                                                           Gravitation                                                        Gravitation
                                                           Aver. Grav.                                                        Aver. Grav.
                   3                                          Objects                 60                                         Objects

                 2.5                                                                  50

                   2                                                                  40

                 1.5                                                                  30

                   1                                                                  20

                 0.5                                                                  10

                   0                                                                   0
                       0              5         10             15               20         0            5            10          15         20



                                          Figure 1: Properties of gravitation in one dimension


where L = max(oai ) − min(oai ) is the length of this dimen-                               • An object that lies near to others (cluster objects) has a
sion and N = |O| is the number of the objects. rpq indicates                                 greater gravitation than that of objects far from others
a proportion of the real length lpq to the average interval                                  (non-cluster objects) (see Figure 1 (b)).
L/(N − 1), so that

                              1                  L2                                     Definition 7. (Average gravitation) The average gravita-
             Gapqi = “                ”2 =    2 (N − 1)2
                                                                         (10)        tion Gai of dimension ai is the gravitation of the middle
                             lpq             lpq
                           L/(N −1)                                                  object of uniformly distributed objects in ai . Gai is pre-
                                                                                     sented with a dotted line in Figure 1.

   Remark 2. If rpq = 0, op and oq stand at a same place                             Suppose om is the object in the middle. Gai could be cal-
in ai . In order to let Gapqi calculable and to get a logical re-                    culated as follows:
sult, we should set rpq greater than 0 but smaller than any                                                                          0            1
                                                                                                             2                 2
other distances. An idea is setting rpq to a half of the mini-                                   X          L                L          X      1
                                                                                     Gai =                            =             ·@            A
mum distance in ai . For example, om , on has the minimum                                               l2 (N − 1)2      (N − 1)2             l2
                                                                                               ∀p, p6=m mp                            ∀p, p6=m mp
distance rmn > 0 in ai , then setting rpq = rmn /2 make sure                                               0                      1
that Gapqi > Gamn
                i
                  , which is expected.                                                               2
                                                                                                   L          X          1
                                                                                           =              ·@                      A·2
                                                                                                           B                      C
                                                                                               (N − 1)2                L
                                                                                                                 N
                                                                                                                    ( N −1
                                                                                                                           · n) 2
                                                                                                                     1≤n≤ 2
   Remark 3. The rpq is such defined as a proportion dis-                                              0               1
tance but not a real distance because that it enables the data
                                                                                                        B X       1 C         π2   π2
with different ranges of values to be calculated into a same                                    =     2·@              −→ 2 ·
                                                                                                                   2 A N →∞
                                                                                                                                 =    ≈ 3.29
range. For example, the attributes age and salary are obvi-                                                     N
                                                                                                                  n           6    3
                                                                                                            1≤n≤ 2
ously in two ranges, but by using such a proportion distance
the two attributes could be calculated and compared together.


There are further definitions, which are important for the
SUGRA algorithm.


   Definition 6. (Gravitation of an object) The gravitation
of an object op in dimension ai is defined as the sum of
gravitation from op with other objects.
                              X
                     Gapi =        Gapqi               (11)
                                  ∀q, q6=p                                                     3.29



  Remark 4. The gravitation of an object defined in (11)
has following properties in one dimension:                                                             3.29



   • An object in the middle has a greater value of gravita-
                                                                                               Figure 2: SUGRA on two dimensional data
     tion than one at the edge, which could be clearly seen,
     if the objects are distributed uniformly (see Figure 1
     (a)).                                                                             Remark 5. The gravitation values of cluster objects and
     non-cluster objects have great differences. The non-cluster            dimensions. Algorithm 1 shows more details about the data
     objects have usually very small values of gravitation, mean-           selection.
     while the cluster objects have larger values. This property is
     very important for the clustering process.                             After the data selection process, we get subspace cluster-
                                                                            ing results in every one dimensional space S ai , a subspace
     If a data set has many objects, which means N is a big num-            cluster St ∈ S ai could have the form like
     ber, then Gai ≈ 3.29, from experiments we found out that
     using the average gravitation Gai as a threshold to separate                              St = ({ai }, {o1 , o5 , o9 , · · · })   (12)
     cluster and non-cluster objects returns good results. This is
     not a silver bullet, but can be thought as a starting point, the       2.3.2    Dimension selection
     threshold could be regulated near this value.
                                                                            Algorithm 2: Dimension selection
                                                                            Input: S ai
     Figure 2 represents an example of SUGRA on two dimen-                  Output: S D
     sional data. The value 3.29 have separated the gravitation                            a
                                                                        1 add all St ∈ S i to S
                                                                                                     D
     on the two dimensional space respectively, where the red           2 foreach S ∈ S
                                                                                            D
                                                                                             do
     points illustrates the gravitation of rand points.
                                                                        3    while find S 0 ∈ S D and S 6= S 0 do
                                                                        4       if |S ∩ S 0 |O ≥ 2 then
     2.3     Algorithm of SUGRA                                         5           add S ∩ S 0 to S D
     This algorithm consists of two steps:                              6       end
                                                                        7    end
                                                                        8 end
       1. Data selection (Clustering in one dimensional spaces)
                                                                        9 return
       2. Dimension selection (Clustering in high dimensional
          spaces)
                                                                            In data-selection, the one dimensional clusters were found
                                                                            with the forms like (12), the finding of subspace cluster in
     As a Bottom-Up algorithm, SUGRA handles data firstly in                high dimension is just based on the intersection defined in
     one dimensional space, because one dimensional data can be             (7). For subspace clusters S1 and S2 , if |S1 ∩ S2 |O ≥ 2, then
     dealt with easily. Finding clusters in high dimensional space          S1 ∩ S2 is a new subspace cluster.
     is based on the clusters found in one dimension.
                                                                            Every combination of clusters should be checked through the
     2.3.1    Data selection                                                intersection, this process will stop when no more new cluster
                                                                            is found. The final result will keep only the largest subspace
     Algorithm 1: Data selection                                            clusters. The detailed algorithm is shown in Algorithm 2.
     Input: D = (A, O)
     Output: S ai
                                                                            2.4     Further discussions
 1 foreach ai ∈ A do                                                        The choosing of parameters is usually difficult for a subspace
               ai
 2    Sort o                                                                clustering algorithm, a little deviation may cause a different
 3    initialize t=1                                                        result. The boundaries between cluster objects and non-
 4    foreach oapi ∈ oai do                                                 cluster objects are especially indistinct and they could be
 5        if Gapi > Gai then                                                recognized hardly. SUGRA uses the gravitation function
 6             if oap−1
                     i
                        ∈ Stai and |oapi − oap−1
                                              i
                                                 | < L then                 that marks the cluster and non-cluster objects with great
                         ai      ai
 7                 add op to St                                             differences, which makes the parameter decision easily.
 8             else
 9                 t++
10                 add oapi to Stai                                         Data selection. The experimental experience shows that
11             end                                                          Gai could separate the cluster objects and non-cluster ob-
12        end                                                               jects very well by data selection, as defined in Definition 7,
13    end                                                                   Gai ≈ 3.29 does not depend on |O| and could be used as
14 end                                                                      threshold for almost all situations. If the results are not sat-
                                                                            isfying, the threshold could be set a little smaller or greater.

     As discussed above, the clusters are firstly selected on each
     dimension through the gravitation. First of all, oai are               Dimension selection. The condition |S1 ∩S2 |O ≥ 2 is used
     sorted in ascending order. For example, if Gapi has a greater          in dimension selection to decide, whether S1 ∩ S2 is a sub-
     value than the threshold Gai , oapi is then chosen as a cluster-       space cluster. The condition indicates that an object-group
     candidate. If its neighbor oap−1
                                    i
                                         is also a cluster-candidate        with more than two objects will be taken as a new cluster.
     and their distance is smaller than the average distance, they          This setting has a high precision, because not only big clus-
     will be considered in one cluster, otherwise oapi is set into a        ters but also small clusters could be found, but naturally it
     new cluster. The process will stop when there is no more               takes much time. In contrast, choosing a greater number
     new cluster found in ai . The processes are the same for other         may gain time but lose some interesting small clusters.
2.4.1    Run time                                                      August 15-18 1999.
The run time of the data selection in a dimension ai is |O|2 ,     [5] J. H. Friedman and J. J. Meulman. Clustering objects
and for D = (A, O) is |A| · |O|2 .                                     on subsets of attributes. Journal of the Royal
                                                                       Statistical Society: Series B (Statistical Methodology),
In dimension selection, every possible combination of sub-             2002.
space clusters could be examined, so the maximum run time          [6] S. Goil, H. Nagesh, and A. Choudhary. Mafia:
of dimension selection is 2m , where m is the number of one            Efficient and scalable subspace clustering for very
dimensional subspace clusters found in data selection.                 large data sets. Technical Report CPDC-TR-9906-010,
                                                                       Northwestern University, June 1999.
3.   CONCLUSIONS                                                   [7] J. Gomez, D. Dasgupta, and O. Nasraoui. A new
Subspace clustering is able to discover clusters and extract           gravitational clustering algorithm. In In Proc. of the
their features from the subspace of high dimensional data,             SIAM Int. Conf. on Data Mining (SDM, 2003.
which is commonly gathered in many fields. Most famil-             [8] A. Hinneburg and H.-H. Gabriel. Denclue 2.0: Fast
iar subspace clustering approaches have the problems with              clustering based on kernel density estimation. In Proc.
determining the parameters. We attempt to apply the grav-              of International Symposium on Intelligent Data
itation function in subspace clustering in order to find out           Analysis 2007 (IDA’07), Ljubljana, Slowenien, 2007.
a new method make the determination of the parameters                  LNAI Springer.
easier. The method is named SUGRA, which belongs to                [9] A. Hinneburg and D. A. Keim. An efficient approach
Bottom-Up algorithms. Firstly, it finds out clusters by using          to clustering in multimedia databases with noise. In
gravitation function in one dimensional space, then it com-            Proc. 4rd Int. Conf. on Knowledge Discovery and
bines the clusters in higher dimensions for searching high             Data Mining, New York, 1998. AAAI Press.
dimensional subspace clusters.                                    [10] D.-S. J. Jae-Woo Chang. A new cell-based clustering
                                                                       method for large, high-dimensional data in data
In SUGRA, the non-cluster objects have always low grav-                mining applications. In Proceedings of the 2002 ACM
itation values (<3.29), meanwhile the cluster objects have             symposium on Applied computing, pages 11–14,
very large values, which depend on the clusters’ density and           Madrid, Spain, March 2002.
number of objects. The value 3.29 does not depend on the          [11] B. Liu, Y. Xia, and P. S. Yu. Clustering through
number of objects, so it enables separating the non-cluster            decision tree construction. In Proceedings of the ninth
objects in order to get cluster objects. We don’t have to              international conference on Information and
choose parameters like other algorithms, SUGRA can get                 knowledge management, pages 20–29, McLean,
almost a satisfying result for a start by using this threshold.        Virginia, United States, November 06-11 2000.
                                                                  [12] L. Parsons, E. Haque, and H. Liu. Subspace clustering
The future research will be focused on optimizing the grav-            for high dimensional data: A review. Sigkdd
itation function and the algorithm in order to improve the             Explorations, 6:90–105, June 2004.
subspace clustering results. The gravitation technique should     [13] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M.
be used not only in one dimensional data but also directly in          Murali. A monte carlo algorithm for fast projective
multiple dimensions. Another work is to let SUGRA adapt                clustering. In Proceedings of the 2002 ACM SIGMOD
various data types, such as categorical data.                          international conference on Management of data,
                                                                       Madison, Wisconsin, June 03-06 2002.
4.   REFERENCES                                                   [14] Y. Shi, Y. Song, and A. Zhang. A shrinking-based
 [1] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc,               approach for multi-dimensional data analysis. In
     and J. S. Park. Fast algorithms for projected                     VLDB ’2003: Proceedings of the 29th international
     clustering. In Proceedings of the 1999 ACM SIGMOD                 conference on Very large data bases, pages 440–451.
     international conference on Management of data,                   VLDB Endowment, 2003.
     pages 61–72, Philadelphia, Pennsylvania, United              [15] K.-G. Woo and J.-H. Lee. FINDIT: a Fast and
     States, May 31-June 03 1999.                                      Intelligent Subspace Clustering Algorithm using
 [2] C. C. Aggarwal and P. S. Yu. Finding generalized                  Dimension Voting. PhD thesis, Korea Advanced
     projected clusters in high dimensional spaces. In                 Institute of Science and Technology, Taejon, Korea,
     Proceedings of the 2000 ACM SIGMOD international                  2002.
     conference on Management of data, pages 70–81,               [16] J. Yang, W. Wang, H. Wang, and P. Yu. δ-clusters:
     Dallas, Texas, United States, May 15-18 2000.                     Capturing subspace correlation in a large data set. In
 [3] R. Agrawal, J. Gehrke, D. Gunopulos, and                          Proceedings of the 18th International Conference on
     P. Raghavan. Automatic subspace clustering of high                Data Engineering, page 517, February 26-March 01
     dimensional data for data mining applications. In                 2002.
     Proceedings of the 1998 ACM SIGMOD international             [17] J. Zhao. Automatische Parameterbestimmung durch
     conference on Management of data, pages 94–105,                   Gravitation in Subspace Clustering. 21. Workshop
     Seattle, Washington, United States, June 01-04 1998.              “Grundlagen von Datenbanken”, Rostock, 2009.
 [4] C.-H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based
     subspace clustering for mining numerical data. In
     Proceedings of the fifth ACM SIGKDD international
     conference on Knowledge discovery and data mining,
     pages 84–93, San Diego, California, United States,