Integrating Fuzzy c-Means Clustering with PostgreSQL ∗ ⃝ c Ruslan Miniakhmetov South Ural State University tavein@gmail.com M.Sc. advisor Mikhail Zymbler Abstract • n ∈ N — cardinal number of training set; Many data sets to be clustered are stored in rela- • X ⊂ Rd — training set for data vectors; tional databases. Having a clusterization algo- • i ∈ N : 1 ⩽ i ⩽ n — vector subscript in a training rithm implemented in SQL provides easier clus- set; terization inside a relational DBMS than out- side with some alternative tools. In this pa- • xi ∈ X — the i-th vector in the sample; per we propose Fuzzy c-Means clustering algo- • k ∈ N — number of clusters; rithm adapted for PostgreSQL open-source re- lational DBMS. • j ∈ N : 1 ⩽ j ⩽ k — cluster number; • C ⊂ Rk×d — matrix with clusters’ centers (cen- 1 Introduction troids); Integrating clustering algorithms is a topic/xplore/al is- • cj ∈ Rd — center of cluster j, d-dimensional vec- sue for database programmers [11]. Such an approach, tor; on the one hand, encapsulates DBMS internal details from application programmer. On the other hand, it al- • xil , cjl ∈ R — l-s coordinates of vectors xi and cj lows to avoid overhead connected with export data out- respectively; side a relational DBMS. The Fuzzy c-Means (FCM) [9, 6, 2] clustering algorithm provides a fuzzy clustering of • U ⊂ Rn×k — matrix with membership degrees, data. Currently this algorithm have many implementa- where uij ∈ R: 0 ⩽ uij ⩽ 1 is a membership tions on a high-level programming languages [5, 7]. For degree between vector xi and cluster j; implementation the FCM algorithm in SQL we choose • ρ(xi , cj ) — distance function, defines a member- an open-source PostgreSQL DBMS [15]. ship degree between vector xi and cluster j; The paper is organized as follows. Section 2 intro- duces basic definitions and an overview of the FCM al- • m ∈ R : m > 1 — the fuzzyfication degree of ob- gorithm. Section 3 proposes implementation of the FCM jective function; in SQL called pgFCM. Section 4 briefly discusses related work. Section 5 contains conclusion remarks and direc- • JF CM — objective function. tions for future work. The FCM is based on minimization of the objective 2 The Fuzzy c-Means Algorithm function JF CM : K-Means [10] is one of the most popular clustering algo- ∑ N ∑ k rithms, it is a simple and fairly fast [3]. The FCM algo- JF CM (X, k, m) = um 2 ij ρ (xi , cj ) (1) rithm generalizes K-Means to provide fuzzy clustering, i=1 j=1 where data vectors can belong to several partitions (clus- ters) at the same time with a given weight (membership Fuzzy clusterization is carried out through an iterative degree). To describe FCM we use the following notation: optimization of the objective function (1). Membership matrix U and centroids cij are updated using the follow- • d ∈ N — dimensionality of a data vectors (or data ing formulas: items) space; k ( ∑ ) 2 ρ(xi , cj ) 1−m uij = (2) • l ∈ N : 1 ⩽ l ⩽ d — subscript of the vector’s coor- ρ(xi , ct ) t=1 dinate; ∑ n ∗ This paper is supported by the Russian Foundation for Basic Re- ij · xil um i=1 search (grant No. 09-07-00241-a). ∀j, l cjl = ∑ n (3) Proceedings of the Spring Researcher’s Colloquium on Database um ij and Information Systems, Moscow, Russia, 2011 i=1 Table 2: Relational Tables of pgFCM Algorithm No. Table Semantics Columns Number of rows 1 SH training set for data vectors (horizontal form) i, x1, x2, . . . , xd n 2 SV training set for data vectors (vertical form) i, l, val n·d 3 C centroids’ coordinates j, l, val k·d 4 SD distances between xi and cj i, j, dist n·k 5 U degree of membership vector xi to a cluster cj on step s i, j, val n·k 6 UT degree of membership vector xi to a cluster cj on step s+1 i, j, val n·k 7 P result of computation function δ (6) on step s d, k, n, s, delta s (s) (s+1) Let s is a number of iteration, uij and uij are el- Table 1: Data Elements Subscripts ements of matrix U on steps s and s+1 respectively, and ε ∈ (0, 1) ⊂ R is a termination criterion. Then the ter- Subscript Range Semantics mination condition can be written as follows: i 1, n vector subscript (s+1) (s) j 1, k cluster subscript max{|uij − uij |} < ε (4) ij l 1, d vector’s coordinate subscript Objective function (1) converges to a local minimum (or To compute the termination criterion 4 we introduce a saddle point) [1]. the function δ as follows: (s+1) (s) Algorithm 1 The Fuzzy c-Means Algorithm δ = max{|uij − uij |} (6) ij Input: X, k, m, ε Output: U 3.2 Database Scheme 1: s := 0, U (0) := (uij ) {initialization} Table 2 summarizes database scheme of pgFCM algo- 2: repeat rithm (underlined columns are primary keys). 3: {computation of new centroids’ coordinates} In order to store sample of a data vectors from set X it Compute C (s) := (cj ) using formula (3) is necessary to define table SH(i, x1, x2, . . . , xd). Each where uij ∈ U (s) row of sample stores vector of data with dimension d and 4: {update matrixes values} subscript i. Table SH has n rows and column i as a Compute U (s) and U (s+1) using formula (2) primary key. 5: s := s + 1 FCM steps demand aggregation of vector coordinates (s) (s−1) 6: until max{|uij − uij |} ⩾ ε (sum, maximum, etc.) from set X. However, because ij of its definition, table SH does not allow using SQL ag- gregation functions. To avoid this obstacle we define a Algorithm 1 shows the basic FCM. The input table SV (i, l, val), which contains n·d rows and have of algorithm receives a set of data vectors X = a composite primary key (i, l). Table SV represents a (x1 , x2 , . . . , xn ), number of clusters k, fuzzyfication de- data sample from table SH ans supports SQL aggrega- gree m, and termination criterion ε. The output is a ma- tion functions max and sum. trix of membership degrees U . Due to store coordinates of cluster centroids tempo- rary table C(j, l, val) is defined. Table C has k·d rows 3 Implementation of Fuzzy c-Means Algo- and the composite primary key (j, l). Like the table SV , rithm using PostgreSQL structure of table C allows to use aggregation functions. In order to store distances ρ(xi , cj ) ta- In this section we suggest pgFCM algorithm as a way to ble SD(i, j, dist) is used. This table has n·k rows integrate FCM algorithm with PostgreSQL DBMS. and the composite primary key (i, j). Table U (i, j, val) stores membership degrees, calcu- 3.1 General Definitions lated on s-th step. To store membership degrees on To integrate FCM algorithm with a relational DBMS it is s+1 step similar table U T (i, j, val) is used. Both tables necessary to perform matrixes U and X as relational ta- have n·k rows and the composite primary key (i, j). bles. Subscripts for identification elements of relational Finally, table P (d, k, n, s, delta) stores iteration num- tables are presented in Table 1 (numbers n, k, d a defined ber s and the result of computation function (6) delta for above in a section 2). this iteration number. Number of rows in table P de- As a function of distance ρ(xi , cj ), without loss of pends on the number of iterations. generality, we use the Euclidean metric: 3.3 The pgFCM Algorithm v u d The pgFCM algorithm is implemented by means of a u∑ ρ(xi , cj ) = t (xil − cjl )2 (5) stored function in PL/pgSQL language. Algorithm 2 l=1 shows the main steps of the pgFCM. Algorithm 2 The pgFCM Algorithm C1: INSERT INTO C Input: SH, k, m, eps SELECT R1.j, l, R1.s1 / R2.s2 AS val Output: U FROM (SELECT l, j, 1: {initialization} sum(U.val^2 * SV.val) Create and initialize temporary tables (U, P, SV , AS s1 etc.) FROM U, SV 2: repeat WHERE U.i = SV.i 3: {computations} GROUP BY l, j) AS R1, 4: Compute centroids coordinates. Update table C. (SELECT j, sum(U.val^2) AS s2 5: Compute distances ρ(xi , cj ). Update table SD. FROM U 6: Compute membership degrees U T = (utij ). GROUP BY j) AS R2 Update table U T . WHERE R1.j = R2.j; 7: {update} Update tables P and U . C2: INSERT INTO SD 8: {check for termination} SELECT i, j, 9: until P.delta ⩾ eps sum((SV.val - C.val)^2) AS dist The input set of data vectors X stored in table SH. FROM SV, C Fuzzyfication degree m, termination criterion ε, and WHERE SV.l = C.l; number of clusters k are function parameters. The ta- GROUP BY i, j; ble U contains a result of pgFCM work. Through the FCM, computations of the distances pro- 3.4 Initialization vide by formula (2). Let us notice that in formula (3) the fraction’s numerator does not depend on t, and rewrite Initialization of tables SV , U , and P provided by SQL- this formula as follows: code I1, I2, and I3 respectively. Table SV is formed by sampling records from the table SH. 2 ∑ k 2 uij = ρ 1−m (xi , cj ) · ρ m−1 (xi , ct ) (7) I1: INSERT INTO SV t=1 SELECT SH.i, 1, x1 FROM SH; ... So, the computation of membership degrees can be INSERT INTO SV written as follows: SELECT SH.i, d, xd FROM SH; C3: INSERT INTO UT For table U a membership degree between data vec- SELECT i, j, tor xi and cluster j takes 1 for all i = j. SD.dist^(2.0^(1.0-m)) I2: INSERT INTO U (i, j, val) * SD1.den AS val VALUES (1, 1, 0); FROM (SELECT i, ... sum(dist^(2.0^(m-1.0))) INSERT INTO U (i, j, val) AS den VALUES (j, j, 1); FROM SD ... GROUP BY i) AS SD1, SD INSERT INTO U (i, j, val) WHERE SD.i = SD1.i; VALUES (n, k, 0); In other words, as a start coordinates of centroids, 3.6 Update first d data vectors from sample X are used. Update stage of the pgFCM modifies P and U tables as ∀i=j uij = 1 ⇒ cj = xi shown below in U1 and U2 respectively. When initializing the table P , the number of points k U1: INSERT INTO P is taken as a parameter of the function pgF CM . A data SELECT L.d, L.k, L.n, L.s + 1 AS s, vectors space dimensionality d and a cardinal number of E.delta the training set n also provided by function pgF CM pa- FROM (SELECT i, j, rameters. The iteration number s and delta initializes as max(UT.val - U.val) zeros. AS delta I3: INSERT INTO P(d, k, n, s, delta) FROM U, UT VALUES (d, k, n, 0, 0); GROUP BY i, j) AS E, (SELECT d, k, n, max(s) 3.5 Computations FROM P GROUP BY d, k, n) AS L) AS R According to Algorithm 2, the computation stage is split- ted to the following three sub-steps: computation coor- Table U T stores temporary membership degrees to be dinates of centroids, computation of distances, and com- inserted into table U . To provide the rapid removal all the putation membership degrees, marked as C1, C2, and C3 table U rows, obtained at the previous iteration, we use respectively. the truncate operator. U2: TRUNCATE U; References INSERT INTO U [1] J. Bezdek, R. Hathaway, M. Sobin, and W. Tucker. SELECT * FROM UT; Convergence Theory for Fuzzy c-means: Coun- terexamples and Repairs. IEEE Trans. Syst. Man 3.7 Check Cybern., 17:873–877, October 1987. This stage is the final for the algorithm pgFCM. On each [2] J. C. Bezdek. Pattern Recognition with Fuzzy Ob- iteration the termination condition (4) must be checked. jective Function Algorithms. Kluwer Academic To implement the check, the result delta of the func- Publishers, Norwell, MA, USA, 1981. tion (6) from table P is stored in the temporary vari- able tmp. [3] P. S. Bradley, U. M. Fayyad, and C. Reina. Scal- ing Clustering Algorithms to Large Databases. In KDD, pages 9–15, 1998. CH1: SELECT delta INTO tmp FROM P, (SELECT d, k, n, [4] J. Clear, D. Dunn, B. Harvey, M. Heytens, max(s) AS max_s P. Lohman, A. Mehta, M. Melton, L. Rohrberg, FROM P A. Savasere, R. Wehrmeister, and M. Xu. Non- GROUP BY d, k, n) AS L Stop SQL/MX primitives for knowledge discovery. WHERE P.s = L.max_s AND P.d = L.d In Proceedings of the fifth ACM SIGKDD interna- AND P.k = L.k AND P.n = L.n; tional conference on Knowledge discovery and data mining, KDD ’99, pages 425–429, New York, NY, After selecting the delta, we need to check the condi- USA, 1999. ACM. tion δ < ε. Then if this condition is true we should stop, otherwise, work will be continued. [5] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and Weingessel A. Machine Learning Open-Source CH2: IF (tmp < eps) THEN Package ‘r-cran-e1071’, 2010. http://cran.r-project. RETURN; org/web/packages/e1071/index.html. END IF; [6] J. C. Dunn. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well- The final result of the algorithm pgFCM will be stored Separated Clusters. Journal of Cybernetics, 3:32– in table U . 57, 1973. [7] Apache Software Foundation, I. Drost, T. Dunning, 4 Related Work J. Eastman, O. Gospodnetic, G. Ingersoll, J. Man- Research on integrating data mining algorithms with re- nix, S. Owen, and K. Wettin. Apache Mahout, lational DBMS includes the following. Association rules 2010. https://cwiki.apache.org/confluence/display/ mining is explored in [13]. General data mining primi- MAHOUT/Fuzzy+K-Means. tives are suggested in [4]. Primitives for decision trees [8] G. Graefe, U. M. Fayyad, and S. Chaudhuri. On mining are proposed in [8]. the Efficient Gathering of Sufficient Statistics for Our research was inspired by papers [11, 12], where Classification from Large SQL Databases. In KDD, integrating K-Means clustering algorithm with relational pages 204–208, 1998. DBMS, was carried out. The way we exploit is similar to mentioned above. The main contribution of the pa- [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clus- per is an extension of results presented in [11, 12] for tering: a review. ACM Comput. Surv., 31:264–323, the case where data vectors may belong to several clus- September 1999. ters. Such a case is very important in problems connected with medicine data analysis [14, 16]. To the best of our [10] J. B. MacQueen. Some Methods for Classifica- knowledge there are no papers devoted to implementing tion and Analysis of MultiVariate Observations. In fuzzy clustering with relational DBMS. L. M. Le Cam and J. Neyman, editors, Proc. of the fifth Berkeley Symposium on Mathematical Statis- tics and Probability, volume 1, pages 281–297. 5 Conclusion University of California Press, 1967. In this paper we have proposed the pgFCM algorithm. [11] C. Ordonez. Programming the K-means cluster- pgFCM implements Fuzzy c-Means clustering algorithm ing algorithm in SQL. In W. Kim, R. Kohavi, and processes data stored in relational tables using Post- J. Gehrke, and W. DuMouchel, editors, KDD, pages greSQL open-source DBMS. There are following issues 823–828. ACM, 2004. to continue our research. Firstly, we plan to investigate pgFCM scalability using both synthetical and real data [12] C. Ordonez. Integrating K-Means Clustering with a sets. The second direction of our research is develop- Relational DBMS Using SQL. IEEE Trans. Knowl. ing a parallel version of pgFCM for distribution memory Data Eng., 18(2):188–201, 2006. multiprocessors. [13] S. Sarawagi, S. Thomas, and R. Agrawal. Integrat- ing association rule mining with relational database systems: alternatives and implications. In Pro- ceedings of the 1998 ACM SIGMOD international conference on Management of data, SIGMOD ’98, pages 343–354, New York, NY, USA, 1998. ACM. [14] A. I. Shihab. Fuzzy Clustering Algorithms and their Applications to Medical Image Analysis. PhD the- sis, University of London, 2000. [15] M. Stonebraker, L. A. Rowe, and M. Hirohama. The Implementation of POSTGRES. IEEE Trans. on Knowl. and Data Eng., 2:125–142, March 1990. [16] D. Zhang and S. Chen. A Novel Kernelized Fuzzy c-Means Algorithm with Application in Medical Image Segmentation. Artificial Intelligence in Medicine, 32:37–50, 2004.