-

Subspace clustering with gravitation

Jiwu Zhao

zhao@cs.uni-duesseldorf.de 0 0 Heinrich-Heine University Institute of Computer Science Databases and Information Systems Universitaetsstr. 1 D-40225 Duesseldorf , Germany

2010

Data mining is a process of discovering and exploiting hidden patterns from data. Clustering as an important task of data mining divides the observations into groups (clusters), which is according to the principle that the observations in the same cluster are similar, and the ones from di erent clusters are dissimilar to each other. Subspace clustering enables clustering in subspaces within a data set, which means the clusters could be found not only in the whole space but also in subspaces. The well-known subspace clustering methods have a common problem, the parameters are hard to be decided. To face this issue, a new subspace clustering method based on Bottom-Up method is introduced in this article. It takes a gravitation function to select data and dimensions by using self-comparison technique. The parameter decision is easy, and does not depend on amount of the data, which makes the subspace clustering more practical.

eol>data mining subspace clustering gravitation high dimensional data

DENCLUE [9] is a density-based clustering algorithm by using Gaussian kernel function as its abstract density function and hill climbing method to nd cluster centers. DENCLUE 2.0 [8] is an improvement on DENCLUE. The algorithms di er from other density-based approaches in that they calculate density to each data point instead of an area in the attribute space. DENCLUE has not to estimate the number or the position of clusters, because clustering is based on the density information of each point. However it is still necessary to estimate the parameters in these two algorithms, such as , in DENCLUE and , p in DENCLUE 2.0. Besides, they are not designed for subspace clustering. Applying the Newton's universal law of gravitation in clustering is not a novel idea. A gravitational clustering algorithm [7] simulates the movement of objects by applying the gravitational force, and detects clusters from merged objects. A shrinking-based approach [14] inspired by gravitation is a grid-based clustering method, which shrinks the objects in a grid cell towards the data centroid and nds the clusters. However, for each algorithm we have to nd appropriate values for its parameters.

1.2 Contributions of the paper

In this paper, we introduce a new density-based bottom-up subspace clustering method called SUGRA (SUbspace clustering method by using GRAvitation's function). The basic idea is similar to DENCLUE, instead of using the Gaussian kernel function, we use gravitation's function with scaled distance to represent the density function, but the objects don not have to move towards the centroid. From this simple idea we have found an interesting property that in one dimensional subspace almost all cluster objects and noncluster objects (noise) are separated by a constant. With this property, we can detect clusters very distinctly, meanwhile, SUGRA realizes the reduction of parameters by subspace clustering.

The remainder of this paper is organized as follows. The idea of SUGRA is introduced in section 2, where section 2.1 and 2.2 are de nitions about cluster and gravitation function respectively, section 2.3 presents the algorithm of SUGRA. The last section contains conclusions and areas of future work.

2. SUBSPACE CLUSTERING WITH GRAV ITATION 2.1 Definition of data set and subspace cluster

A data set consists of objects and their attributes. Usually, all objects have common attributes in a data set, such as color, price, length etc., and every object has a value for an attribute. The values that are related to an attribute are in the same domain and conform to the same constraints. A data set could be described as a table, where the objects are just rows, meanwhile the attributes are columns. The attributes could also be considered as dimensions, so that each attribute represents one dimension, and then the objects are points in these dimensions.

In order to describe the attributes and objects clearly, they are de ned as follows:

De nition 1. (Data set) Generally, a data set D could be considered as a pair, which is the combination of A and O:

D = (A; O) where A is a set of all attributes (dimensions), and O is a set of all objects:

A = fa1; a2; ; ai; g;

O = fo1; o2; ; op; g ( 2 ) where op is an object with values on A: We denote the values of all objects on attribute ai with: op = fopa1 ; opa2 ; oai = fo1ai ; o2ai ; ; opai ; ; opai ; g g

De nition 2. (Subspace cluster) A subspace cluster S is also a data set and de ned as follows:

S = De = (Ae; Oe) ( 1 ) ( 3 ) ( 4 ) ( 5 ) where Ae A and Oe O, and S must satisfy a particular condition C, which is de ned di erently in every subspace clustering algorithm.

A subspace cluster S could then be written like this: S = (Ae; Oe) = (fa1; a12; a60; g; fo1; o5; o30; g) The cardinality regarding the objects and the dimensions in S are de ned respectively: jSjO = jOej; jSjA = jAej ( 6 )

Remark 1. Suppose S1; S2 are two subspace clusters, where S1 = (A1; O1) and S2 = (A2; O2), then

If A1 6= A2 _ O1 6= O2 =) S1 6= S2, the subspace clusters with di erent dimensions or objects are considered as di erent ones. 8A0

A1, S0 = (A0; O1) is also a subspace cluster.

If A1 A2 ^ O1 = O2 or A1 = A2 ^ O1 O2 =) S1 < S2. Only the largest subspace cluster will be taken in the clustering result. ( 8 ) ( 9 )

De nition 3. The intersection of two subspace clusters is de ned as follows:

S1 \ S2 = (A1 [ A2; O1 \ O2) ( 7 )

De nition 4. Sai is the set of all subspace clusters found in dimension ai, SD is the set of all subspace clusters found in D, nding SD is the task of subspace clustering.

2.2 Gravitation

Gravitation is a natural phenomenon, which describes the force of attraction between objects with mass. The gravitation is important, because it in uences our normal lives. The Newton's law of universal gravitation is de ned as follows:

G = G m1m2 r2 where G is the gravity between two point masses, G is the gravitational constant, m1 and m2 are the masses of two points respectively, r is the distance between the two point masses.

SUGRA tries to use the gravitation's function for the measurement between the objects. In order to make the calculation easier, a simple gravity function is used here: De nition 5. (Simple gravity function)

Gpaqi = mpmq rp2q Suppose that a single object op has a mass mp = 1, rpq is the distance between op and oq is de ned as rpq = lpq , L=(N 1) where L = max(oai ) min(oai ) is the length of this dimension and N = jOj is the number of the objects. rpq indicates a proportion of the real length lpq to the average interval L=(N 1), so that

Remark 2. If rpq = 0, op and oq stand at a same place in ai. In order to let Gpaqi calculable and to get a logical result, we should set rpq greater than 0 but smaller than any other distances. An idea is setting rpq to a half of the minimum distance in ai. For example, om, on has the minimum distance rmn > 0 in ai, then setting rpq = rmn=2 make sure that Gpaqi > Gamin, which is expected.

Remark 3. The rpq is such de ned as a proportion distance but not a real distance because that it enables the data with di erent ranges of values to be calculated into a same range. For example, the attributes age and salary are obviously in two ranges, but by using such a proportion distance the two attributes could be calculated and compared together. There are further de nitions, which are important for the SUGRA algorithm.

De nition 6. (Gravitation of an object) The gravitation of an object op in dimension ai is de ned as the sum of gravitation from op with other objects.

Gpai =

X 8q; q6=p

Gpaqi ( 11 ) An object that lies near to others (cluster objects) has a greater gravitation than that of objects far from others (non-cluster objects) (see Figure 1 (b)).

De nition 7. (Average gravitation) The average gravitation Gai of dimension ai is the gravitation of the middle object of uniformly distributed objects in ai. Gai is presented with a dotted line in Figure 1.

Suppose om is the object in the middle. Gai could be calculated as follows: Gai

X 8p; p6=m lm2p(N 1)2

L2 0 1

1 1 n N2 ( L

N 1

L2 (N

1)2 0

X 2 B C @1 n N2 n2 A N!!1 2 (N 1 2 6

L2 1)2 1

C 2 n)2 A = 2 3 0

1 8p; p6=m lm2p 1

A 3:29 (a) (b) Gravitation Aver. Grav.

Objects Gravitation Aver. Grav.

Objects 0 5 10 15 20 0 5 10 15 20 Remark 4. The gravitation of an object de ned in ( 11 ) has following properties in one dimension: An object in the middle has a greater value of gravitation than one at the edge, which could be clearly seen, if the objects are distributed uniformly (see Figure 1 (a)). Remark 5. The gravitation values of cluster objects and non-cluster objects have great di erences. The non-cluster objects have usually very small values of gravitation, meanwhile the cluster objects have larger values. This property is very important for the clustering process.

If a data set has many objects, which means N is a big number, then Gai 3:29, from experiments we found out that using the average gravitation Gai as a threshold to separate cluster and non-cluster objects returns good results. This is not a silver bullet, but can be thought as a starting point, the threshold could be regulated near this value.

Figure 2 represents an example of SUGRA on two dimensional data. The value 3.29 have separated the gravitation on the two dimensional space respectively, where the red points illustrates the gravitation of rand points.

2.3 Algorithm of SUGRA

This algorithm consists of two steps: 1. Data selection (Clustering in one dimensional spaces) 2. Dimension selection (Clustering in high dimensional spaces) As a Bottom-Up algorithm, SUGRA handles data rstly in one dimensional space, because one dimensional data can be dealt with easily. Finding clusters in high dimensional space is based on the clusters found in one dimension. 2.3.1

Data selection

Algorithm 1: Data selection

IOnuptuptu: tD: S=ai(A; O) 1 foreSaocrht oaaii 2 A do 2 3 4 5 6 7 8 9 10 11 end 12 end 13 end 14 end initialize t=1 foreach opai 2 oai do if Gpai > Gai then if opai 1 2 Stai and jopai

add opai to Stai else t++ add opai to Stai opai 1j < L then As discussed above, the clusters are rstly selected on each dimension through the gravitation. First of all, oai are sorted in ascending order. For example, if Gpai has a greater cvaanludeidtahtaen. thIfe itthsrensehigohldboGraio,aiopai is then chosen as a clusterp 1 is also a cluster-candidate and their distance is smaller than the average distance, they will be considered in one cluster, otherwise opai is set into a new cluster. The process will stop when there is no more new cluster found in ai. The processes are the same for other dimensions. Algorithm 1 shows more details about the data selection.

After the data selection process, we get subspace clustering results in every one dimensional space Sai , a subspace cluster St 2 Sai could have the form like

St = (faig; fo1; o5; o9; g) ( 12 ) 2.3.2

Dimension selection

Algorithm 2: Dimension selection Input: Sai

Output: SD 1 add all St 2 Sai to SD 2 foreach S 2 SD do 3 while nd S0 2 SD and S 6= S0 do 4 if jS \ S0jO 2 then 5 add S \ S0 to SD 6 end 7 end 8 end 9 return

In data-selection, the one dimensional clusters were found with the forms like ( 12 ), the nding of subspace cluster in high dimension is just based on the intersection de ned in ( 7 ). For subspace clusters S1 and S2, if jS1 \ S2jO 2, then S1 \ S2 is a new subspace cluster.

Every combination of clusters should be checked through the intersection, this process will stop when no more new cluster is found. The nal result will keep only the largest subspace clusters. The detailed algorithm is shown in Algorithm 2.

2.4 Further discussions

The choosing of parameters is usually di cult for a subspace clustering algorithm, a little deviation may cause a di erent result. The boundaries between cluster objects and noncluster objects are especially indistinct and they could be recognized hardly. SUGRA uses the gravitation function that marks the cluster and non-cluster objects with great di erences, which makes the parameter decision easily. Data selection. The experimental experience shows that Gai could separate the cluster objects and non-cluster objects very well by data selection, as de ned in De nition 7, Gai 3:29 does not depend on jOj and could be used as threshold for almost all situations. If the results are not satisfying, the threshold could be set a little smaller or greater. Dimension selection. The condition jS1 \S2jO 2 is used in dimension selection to decide, whether S1 \ S2 is a subspace cluster. The condition indicates that an object-group with more than two objects will be taken as a new cluster. This setting has a high precision, because not only big clusters but also small clusters could be found, but naturally it takes much time. In contrast, choosing a greater number may gain time but lose some interesting small clusters.

In dimension selection, every possible combination of subspace clusters could be examined, so the maximum run time of dimension selection is 2m, where m is the number of one dimensional subspace clusters found in data selection.

3. CONCLUSIONS

Subspace clustering is able to discover clusters and extract their features from the subspace of high dimensional data, which is commonly gathered in many elds. Most familiar subspace clustering approaches have the problems with determining the parameters. We attempt to apply the gravitation function in subspace clustering in order to nd out a new method make the determination of the parameters easier. The method is named SUGRA, which belongs to Bottom-Up algorithms. Firstly, it nds out clusters by using gravitation function in one dimensional space, then it combines the clusters in higher dimensions for searching high dimensional subspace clusters.

In SUGRA, the non-cluster objects have always low gravitation values (<3.29), meanwhile the cluster objects have very large values, which depend on the clusters' density and number of objects. The value 3.29 does not depend on the number of objects, so it enables separating the non-cluster objects in order to get cluster objects. We don't have to choose parameters like other algorithms, SUGRA can get almost a satisfying result for a start by using this threshold. The future research will be focused on optimizing the gravitation function and the algorithm in order to improve the subspace clustering results. The gravitation technique should be used not only in one dimensional data but also directly in multiple dimensions. Another work is to let SUGRA adapt various data types, such as categorical data.

[1]

C. C.

Aggarwal ,

J. L.

Wolf ,

P. S.

Yu ,

Procopiuc , and

J. S.

Park . Fast algorithms for projected clustering . In Proceedings of the 1999 ACM SIGMOD international conference on Management of data , pages 61 { 72 , Philadelphia, Pennsylvania, United States, May 31-June 03 1999 .

[2]

C. C.

Aggarwal and

P. S.

Yu . Finding generalized projected clusters in high dimensional spaces . In Proceedings of the 2000 ACM SIGMOD international conference on Management of data , pages 70 { 81 , Dallas, Texas, United States, May 15 -18 2000 .

[3]

Agrawal ,

Gehrke ,

Gunopulos , and

Raghavan . Automatic subspace clustering of high dimensional data for data mining applications . In Proceedings of the 1998 ACM SIGMOD international conference on Management of data , pages 94 { 105 , Seattle, Washington, United States, June 01-04 1998 .

[4] C.-H. Cheng , A. W.

Fu , and Y. Zhang.

Entropy-based subspace clustering for mining numerical data . In Proceedings of the fth ACM SIGKDD international conference on Knowledge discovery and data mining , pages 84 { 93 , San Diego, California, United States, August 15 -18 1999 .

[5]

J. H.

Friedman and

J. J.

Meulman . Clustering objects on subsets of attributes . Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 2002 .

[6]

Goil ,

Nagesh , and

Choudhary . Ma a: E cient and scalable subspace clustering for very large data sets . Technical Report CPDC-TR-9906-010 , Northwestern University, June 1999 .

[7]

Gomez ,

Dasgupta , and

Nasraoui . A new gravitational clustering algorithm . In In Proc. of the SIAM Int. Conf. on Data Mining (SDM , 2003 .

[8]

Hinneburg and H.-H. Gabriel . Denclue 2. 0: Fast clustering based on kernel density estimation . In Proc. of International Symposium on Intelligent Data Analysis 2007 (IDA'07) , Ljubljana, Slowenien, 2007 . LNAI Springer.

[9]

Hinneburg and

D. A.

Keim . An e cient approach to clustering in multimedia databases with noise . In Proc. 4rd Int. Conf. on Knowledge Discovery and Data Mining , New York, 1998 . AAAI Press.

[10] D.-S. J. Jae-Woo Chang . A new cell-based clustering method for large, high-dimensional data in data mining applications . In Proceedings of the 2002 ACM symposium on Applied computing , pages 11 { 14 , Madrid, Spain, March 2002 .

[11]

Liu ,

Xia , and

P. S.

Yu . Clustering through decision tree construction . In Proceedings of the ninth international conference on Information and knowledge management , pages 20 { 29 , McLean , Virginia, United States, November 06 -11 2000 .

[12]

Parsons , E. Haque, and

Liu . Subspace clustering for high dimensional data: A review . Sigkdd Explorations , 6 : 90 { 105 , June 2004 .

[13] C. M. Procopiuc , M.

Jones , P. K.

Agarwal , and T. M.

Murali . A monte carlo algorithm for fast projective clustering . In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, Madison , Wisconsin, June 03-06 2002 .

[14]

Shi ,

Song , and

A. Zhang.

A shrinking-based approach for multi-dimensional data analysis . In VLDB '2003: Proceedings of the 29th international conference on Very large data bases , pages 440 { 451 . VLDB Endowment, 2003 .

[15] K.-G. Woo and J.-H. Lee . FINDIT: a Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting . PhD thesis , Korea Advanced Institute of Science and Technology, Taejon, Korea, 2002 .

[16]

Yang ,

Wang ,

Wang , and

Yu . -clusters: Capturing subspace correlation in a large data set . In Proceedings of the 18th International Conference on Data Engineering , page 517 , February 26-March 01 2002 .

[17]

Zhao . Automatische Parameterbestimmung durch Gravitation in Subspace Clustering. 21. Workshop \Grundlagen von Datenbanken" , Rostock , 2009 .