=Paper=
{{Paper
|id=None
|storemode=property
|title=Subspace Clustering with Gravitation
|pdfUrl=https://ceur-ws.org/Vol-581/gvd2010_1_1.pdf
|volume=Vol-581
|dblpUrl=https://dblp.org/rec/conf/gvd/Zhao10
}}
==Subspace Clustering with Gravitation==
Subspace clustering with gravitation ∗
Jiwu Zhao
Heinrich-Heine University
Institute of Computer Science
Databases and Information Systems
Universitaetsstr.1
D-40225 Duesseldorf, Germany
zhao@cs.uni-duesseldorf.de
ABSTRACT analysis, bio science, fraud detection and so on.
Data mining is a process of discovering and exploiting hid-
den patterns from data. Clustering as an important task of Subspace clustering enables clustering in subspaces within
data mining divides the observations into groups (clusters), a data set, which means that the clusters could be found in
which is according to the principle that the observations in subspaces rather than only in the whole space.
the same cluster are similar, and the ones from different clus-
ters are dissimilar to each other. Subspace clustering enables
clustering in subspaces within a data set, which means the 1.1 Related works
clusters could be found not only in the whole space but also In a review of subspace clustering [12], the subspace cluster-
in subspaces. The well-known subspace clustering methods ing algorithms are categorized into two groups: top-down
have a common problem, the parameters are hard to be de- search and bottom-up search methods. Top-down methods
cided. To face this issue, a new subspace clustering method like PROCLUS[1], ORCLUS[2], FINDIT[15], σ-Clusters[16],
based on Bottom-Up method is introduced in this article. It COSA[5] use multiple iterations for improving the cluster-
takes a gravitation function to select data and dimensions ing results. By contrast, bottom-up methods find firstly
by using self-comparison technique. The parameter decision clusters in lower subspaces, and then expand the searching
is easy, and does not depend on amount of the data, which by adding more dimensions. Some examples are CLIQUE
makes the subspace clustering more practical. [3], ENCLUS [4], MAFIA [6], CBF [10], CLTree [11], DOC
[13].
Keywords No matter in which group, almost all subspace clustering
data mining, subspace clustering, gravitation, high dimen- algorithms have a common problem with finding appropriate
sional data values for their parameters. For instance, most of top-down
methods have to estimate parameters like number of clusters
1. INTRODUCTION and subspaces, the clustering results are improved by the
Because of the modern techniques, data collection is nowa- iterations that are based on these parameters, which are
days efficient and cost-effective. The data’s amount is huge absolutely not easy to estimate. In bottom-up methods, the
and most data is stored in a raw form, which is not ana- key parameters such as density, grid interval, size of clusters,
lyzed yet. Usually we need to find out unknown or hidden etc. are also hard to be determined. It is necessary to find a
information from raw data. Data mining is such a process of method that determines parameters easily, in order to make
discovering and exploiting hidden patterns from data. It in- the clustering job more practical.
volves clustering, classification, regression, association, etc.
DENCLUE [9] is a density-based clustering algorithm by us-
Clustering divides the observations into groups (clusters), so ing Gaussian kernel function as its abstract density function
that the observations in the same cluster are similar, mean- and hill climbing method to find cluster centers. DENCLUE
while, the ones from different clusters are dissimilar. Clus- 2.0 [8] is an improvement on DENCLUE. The algorithms
tering is important for data analysis, such as market basket differ from other density-based approaches in that they cal-
∗Copyright is held by the author/owner(s). culate density to each data point instead of an area in the
attribute space. DENCLUE has not to estimate the num-
GvD Workshop’10, 25.-28.05.2010, Bad Helmstedt, Ger-
many. ber or the position of clusters, because clustering is based on
the density information of each point. However it is still nec-
essary to estimate the parameters in these two algorithms,
such as σ, ξ in DENCLUE and , p in DENCLUE 2.0. Be-
sides, they are not designed for subspace clustering.
Applying the Newton’s universal law of gravitation in clus-
tering is not a novel idea. A gravitational clustering algo-
rithm [7] simulates the movement of objects by applying
the gravitational force, and detects clusters from merged
objects. A shrinking-based approach [14] inspired by gravi-
tation is a grid-based clustering method, which shrinks the where Ae ⊆ A and O e ⊆ O, and S must satisfy a particular
objects in a grid cell towards the data centroid and finds condition C, which is defined differently in every subspace
the clusters. However, for each algorithm we have to find clustering algorithm.
appropriate values for its parameters.
A subspace cluster S could then be written like this:
1.2 Contributions of the paper S = (A,
e O)
e = ({a1 , a12 , a60 , · · · }, {o1 , o5 , o30 , · · · })
In this paper, we introduce a new density-based bottom-up
subspace clustering method called SUGRA (SUbspace clus-
tering method by using GRAvitation’s function). The basic The cardinality regarding the objects and the dimensions in
idea is similar to DENCLUE, instead of using the Gaussian S are defined respectively:
kernel function, we use gravitation’s function with scaled
distance to represent the density function, but the objects |S|O = |O|,
e |S|A = |A|
e (6)
don not have to move towards the centroid. From this sim-
ple idea we have found an interesting property that in one
dimensional subspace almost all cluster objects and non- Remark 1. Suppose S1 , S2 are two subspace clusters, where
cluster objects (noise) are separated by a constant. With S1 = (A1 , O1 ) and S2 = (A2 , O2 ), then
this property, we can detect clusters very distinctly, mean-
while, SUGRA realizes the reduction of parameters by sub- • If A1 6= A2 ∨ O1 6= O2 =⇒ S1 6= S2 , the subspace clus-
space clustering. ters with different dimensions or objects are considered
as different ones.
The remainder of this paper is organized as follows. The
idea of SUGRA is introduced in section 2, where section 2.1 • ∀A0 ⊆ A1 , S 0 = (A0 , O1 ) is also a subspace cluster.
and 2.2 are definitions about cluster and gravitation function
respectively, section 2.3 presents the algorithm of SUGRA. • If A1 ⊆ A2 ∧ O1 = O2 or A1 = A2 ∧ O1 ⊆ O2 =⇒
The last section contains conclusions and areas of future S1 < S2 . Only the largest subspace cluster will be taken
work. in the clustering result.
2. SUBSPACE CLUSTERING WITH GRAV- Definition 3. The intersection of two subspace clusters is
ITATION defined as follows:
2.1 Definition of data set and subspace cluster S1 ∩ S2 = (A1 ∪ A2 , O1 ∩ O2 ) (7)
A data set consists of objects and their attributes. Usually,
all objects have common attributes in a data set, such as
color, price, length etc., and every object has a value for an Definition 4. S ai is the set of all subspace clusters found
attribute. The values that are related to an attribute are in dimension ai , S D is the set of all subspace clusters found
in the same domain and conform to the same constraints. in D, finding S D is the task of subspace clustering.
A data set could be described as a table, where the objects
are just rows, meanwhile the attributes are columns. The
attributes could also be considered as dimensions, so that 2.2 Gravitation
each attribute represents one dimension, and then the ob- Gravitation is a natural phenomenon, which describes the
jects are points in these dimensions. force of attraction between objects with mass. The gravita-
tion is important, because it influences our normal lives.
In order to describe the attributes and objects clearly, they
are defined as follows: The Newton’s law of universal gravitation is defined as fol-
lows:
m1 m2
Definition 1. (Data set) Generally, a data set D could be G=G· (8)
r2
considered as a pair, which is the combination of A and O:
where G is the gravity between two point masses, G is the
D = (A, O) (1) gravitational constant, m1 and m2 are the masses of two
where A is a set of all attributes (dimensions), and O is a points respectively, r is the distance between the two point
set of all objects: masses.
A = {a1 , a2 , · · · , ai , · · · }, O = {o1 , o2 , · · · , op , · · · } (2) SUGRA tries to use the gravitation’s function for the mea-
where op is an object with values on A: surement between the objects. In order to make the calcu-
lation easier, a simple gravity function is used here:
op = {oap1 , oap2 , · · · , oapi , · · · } (3)
We denote the values of all objects on attribute ai with:
Definition 5. (Simple gravity function)
oai = {oa1 i , oa2 i , · · · , oapi , · · · } (4) mp mq
Gapqi = 2
(9)
rpq
Definition 2. (Subspace cluster) A subspace cluster S is
also a data set and defined as follows: Suppose that a single object op has a mass mp = 1, rpq is the
lpq
S=D e = (A,e O) (5) distance between op and oq is defined as rpq = ,
e L/(N − 1)
(a) (b)
3.5 70
Gravitation Gravitation
Aver. Grav. Aver. Grav.
3 Objects 60 Objects
2.5 50
2 40
1.5 30
1 20
0.5 10
0 0
0 5 10 15 20 0 5 10 15 20
Figure 1: Properties of gravitation in one dimension
where L = max(oai ) − min(oai ) is the length of this dimen- • An object that lies near to others (cluster objects) has a
sion and N = |O| is the number of the objects. rpq indicates greater gravitation than that of objects far from others
a proportion of the real length lpq to the average interval (non-cluster objects) (see Figure 1 (b)).
L/(N − 1), so that
1 L2 Definition 7. (Average gravitation) The average gravita-
Gapqi = “ ”2 = 2 (N − 1)2
(10) tion Gai of dimension ai is the gravitation of the middle
lpq lpq
L/(N −1) object of uniformly distributed objects in ai . Gai is pre-
sented with a dotted line in Figure 1.
Remark 2. If rpq = 0, op and oq stand at a same place Suppose om is the object in the middle. Gai could be cal-
in ai . In order to let Gapqi calculable and to get a logical re- culated as follows:
sult, we should set rpq greater than 0 but smaller than any 0 1
2 2
other distances. An idea is setting rpq to a half of the mini- X L L X 1
Gai = = ·@ A
mum distance in ai . For example, om , on has the minimum l2 (N − 1)2 (N − 1)2 l2
∀p, p6=m mp ∀p, p6=m mp
distance rmn > 0 in ai , then setting rpq = rmn /2 make sure 0 1
that Gapqi > Gamn
i
, which is expected. 2
L X 1
= ·@ A·2
B C
(N − 1)2 L
N
( N −1
· n) 2
1≤n≤ 2
Remark 3. The rpq is such defined as a proportion dis- 0 1
tance but not a real distance because that it enables the data
B X 1 C π2 π2
with different ranges of values to be calculated into a same = 2·@ −→ 2 ·
2 A N →∞
= ≈ 3.29
range. For example, the attributes age and salary are obvi- N
n 6 3
1≤n≤ 2
ously in two ranges, but by using such a proportion distance
the two attributes could be calculated and compared together.
There are further definitions, which are important for the
SUGRA algorithm.
Definition 6. (Gravitation of an object) The gravitation
of an object op in dimension ai is defined as the sum of
gravitation from op with other objects.
X
Gapi = Gapqi (11)
∀q, q6=p 3.29
Remark 4. The gravitation of an object defined in (11)
has following properties in one dimension: 3.29
• An object in the middle has a greater value of gravita-
Figure 2: SUGRA on two dimensional data
tion than one at the edge, which could be clearly seen,
if the objects are distributed uniformly (see Figure 1
(a)). Remark 5. The gravitation values of cluster objects and
non-cluster objects have great differences. The non-cluster dimensions. Algorithm 1 shows more details about the data
objects have usually very small values of gravitation, mean- selection.
while the cluster objects have larger values. This property is
very important for the clustering process. After the data selection process, we get subspace cluster-
ing results in every one dimensional space S ai , a subspace
If a data set has many objects, which means N is a big num- cluster St ∈ S ai could have the form like
ber, then Gai ≈ 3.29, from experiments we found out that
using the average gravitation Gai as a threshold to separate St = ({ai }, {o1 , o5 , o9 , · · · }) (12)
cluster and non-cluster objects returns good results. This is
not a silver bullet, but can be thought as a starting point, the 2.3.2 Dimension selection
threshold could be regulated near this value.
Algorithm 2: Dimension selection
Input: S ai
Figure 2 represents an example of SUGRA on two dimen- Output: S D
sional data. The value 3.29 have separated the gravitation a
1 add all St ∈ S i to S
D
on the two dimensional space respectively, where the red 2 foreach S ∈ S
D
do
points illustrates the gravitation of rand points.
3 while find S 0 ∈ S D and S 6= S 0 do
4 if |S ∩ S 0 |O ≥ 2 then
2.3 Algorithm of SUGRA 5 add S ∩ S 0 to S D
This algorithm consists of two steps: 6 end
7 end
8 end
1. Data selection (Clustering in one dimensional spaces)
9 return
2. Dimension selection (Clustering in high dimensional
spaces)
In data-selection, the one dimensional clusters were found
with the forms like (12), the finding of subspace cluster in
As a Bottom-Up algorithm, SUGRA handles data firstly in high dimension is just based on the intersection defined in
one dimensional space, because one dimensional data can be (7). For subspace clusters S1 and S2 , if |S1 ∩ S2 |O ≥ 2, then
dealt with easily. Finding clusters in high dimensional space S1 ∩ S2 is a new subspace cluster.
is based on the clusters found in one dimension.
Every combination of clusters should be checked through the
2.3.1 Data selection intersection, this process will stop when no more new cluster
is found. The final result will keep only the largest subspace
Algorithm 1: Data selection clusters. The detailed algorithm is shown in Algorithm 2.
Input: D = (A, O)
Output: S ai
2.4 Further discussions
1 foreach ai ∈ A do The choosing of parameters is usually difficult for a subspace
ai
2 Sort o clustering algorithm, a little deviation may cause a different
3 initialize t=1 result. The boundaries between cluster objects and non-
4 foreach oapi ∈ oai do cluster objects are especially indistinct and they could be
5 if Gapi > Gai then recognized hardly. SUGRA uses the gravitation function
6 if oap−1
i
∈ Stai and |oapi − oap−1
i
| < L then that marks the cluster and non-cluster objects with great
ai ai
7 add op to St differences, which makes the parameter decision easily.
8 else
9 t++
10 add oapi to Stai Data selection. The experimental experience shows that
11 end Gai could separate the cluster objects and non-cluster ob-
12 end jects very well by data selection, as defined in Definition 7,
13 end Gai ≈ 3.29 does not depend on |O| and could be used as
14 end threshold for almost all situations. If the results are not sat-
isfying, the threshold could be set a little smaller or greater.
As discussed above, the clusters are firstly selected on each
dimension through the gravitation. First of all, oai are Dimension selection. The condition |S1 ∩S2 |O ≥ 2 is used
sorted in ascending order. For example, if Gapi has a greater in dimension selection to decide, whether S1 ∩ S2 is a sub-
value than the threshold Gai , oapi is then chosen as a cluster- space cluster. The condition indicates that an object-group
candidate. If its neighbor oap−1
i
is also a cluster-candidate with more than two objects will be taken as a new cluster.
and their distance is smaller than the average distance, they This setting has a high precision, because not only big clus-
will be considered in one cluster, otherwise oapi is set into a ters but also small clusters could be found, but naturally it
new cluster. The process will stop when there is no more takes much time. In contrast, choosing a greater number
new cluster found in ai . The processes are the same for other may gain time but lose some interesting small clusters.
2.4.1 Run time August 15-18 1999.
The run time of the data selection in a dimension ai is |O|2 , [5] J. H. Friedman and J. J. Meulman. Clustering objects
and for D = (A, O) is |A| · |O|2 . on subsets of attributes. Journal of the Royal
Statistical Society: Series B (Statistical Methodology),
In dimension selection, every possible combination of sub- 2002.
space clusters could be examined, so the maximum run time [6] S. Goil, H. Nagesh, and A. Choudhary. Mafia:
of dimension selection is 2m , where m is the number of one Efficient and scalable subspace clustering for very
dimensional subspace clusters found in data selection. large data sets. Technical Report CPDC-TR-9906-010,
Northwestern University, June 1999.
3. CONCLUSIONS [7] J. Gomez, D. Dasgupta, and O. Nasraoui. A new
Subspace clustering is able to discover clusters and extract gravitational clustering algorithm. In In Proc. of the
their features from the subspace of high dimensional data, SIAM Int. Conf. on Data Mining (SDM, 2003.
which is commonly gathered in many fields. Most famil- [8] A. Hinneburg and H.-H. Gabriel. Denclue 2.0: Fast
iar subspace clustering approaches have the problems with clustering based on kernel density estimation. In Proc.
determining the parameters. We attempt to apply the grav- of International Symposium on Intelligent Data
itation function in subspace clustering in order to find out Analysis 2007 (IDA’07), Ljubljana, Slowenien, 2007.
a new method make the determination of the parameters LNAI Springer.
easier. The method is named SUGRA, which belongs to [9] A. Hinneburg and D. A. Keim. An efficient approach
Bottom-Up algorithms. Firstly, it finds out clusters by using to clustering in multimedia databases with noise. In
gravitation function in one dimensional space, then it com- Proc. 4rd Int. Conf. on Knowledge Discovery and
bines the clusters in higher dimensions for searching high Data Mining, New York, 1998. AAAI Press.
dimensional subspace clusters. [10] D.-S. J. Jae-Woo Chang. A new cell-based clustering
method for large, high-dimensional data in data
In SUGRA, the non-cluster objects have always low grav- mining applications. In Proceedings of the 2002 ACM
itation values (<3.29), meanwhile the cluster objects have symposium on Applied computing, pages 11–14,
very large values, which depend on the clusters’ density and Madrid, Spain, March 2002.
number of objects. The value 3.29 does not depend on the [11] B. Liu, Y. Xia, and P. S. Yu. Clustering through
number of objects, so it enables separating the non-cluster decision tree construction. In Proceedings of the ninth
objects in order to get cluster objects. We don’t have to international conference on Information and
choose parameters like other algorithms, SUGRA can get knowledge management, pages 20–29, McLean,
almost a satisfying result for a start by using this threshold. Virginia, United States, November 06-11 2000.
[12] L. Parsons, E. Haque, and H. Liu. Subspace clustering
The future research will be focused on optimizing the grav- for high dimensional data: A review. Sigkdd
itation function and the algorithm in order to improve the Explorations, 6:90–105, June 2004.
subspace clustering results. The gravitation technique should [13] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M.
be used not only in one dimensional data but also directly in Murali. A monte carlo algorithm for fast projective
multiple dimensions. Another work is to let SUGRA adapt clustering. In Proceedings of the 2002 ACM SIGMOD
various data types, such as categorical data. international conference on Management of data,
Madison, Wisconsin, June 03-06 2002.
4. REFERENCES [14] Y. Shi, Y. Song, and A. Zhang. A shrinking-based
[1] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, approach for multi-dimensional data analysis. In
and J. S. Park. Fast algorithms for projected VLDB ’2003: Proceedings of the 29th international
clustering. In Proceedings of the 1999 ACM SIGMOD conference on Very large data bases, pages 440–451.
international conference on Management of data, VLDB Endowment, 2003.
pages 61–72, Philadelphia, Pennsylvania, United [15] K.-G. Woo and J.-H. Lee. FINDIT: a Fast and
States, May 31-June 03 1999. Intelligent Subspace Clustering Algorithm using
[2] C. C. Aggarwal and P. S. Yu. Finding generalized Dimension Voting. PhD thesis, Korea Advanced
projected clusters in high dimensional spaces. In Institute of Science and Technology, Taejon, Korea,
Proceedings of the 2000 ACM SIGMOD international 2002.
conference on Management of data, pages 70–81, [16] J. Yang, W. Wang, H. Wang, and P. Yu. δ-clusters:
Dallas, Texas, United States, May 15-18 2000. Capturing subspace correlation in a large data set. In
[3] R. Agrawal, J. Gehrke, D. Gunopulos, and Proceedings of the 18th International Conference on
P. Raghavan. Automatic subspace clustering of high Data Engineering, page 517, February 26-March 01
dimensional data for data mining applications. In 2002.
Proceedings of the 1998 ACM SIGMOD international [17] J. Zhao. Automatische Parameterbestimmung durch
conference on Management of data, pages 94–105, Gravitation in Subspace Clustering. 21. Workshop
Seattle, Washington, United States, June 01-04 1998. “Grundlagen von Datenbanken”, Rostock, 2009.
[4] C.-H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based
subspace clustering for mining numerical data. In
Proceedings of the fifth ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 84–93, San Diego, California, United States,