Integrating Fuzzy c-Means Clustering with PostgreSQL ∗

                                                     ⃝
                                                     c Ruslan Miniakhmetov
                                                      South Ural State University
                                                          tavein@gmail.com

                                                    M.Sc. advisor Mikhail Zymbler


                           Abstract                                       • n ∈ N — cardinal number of training set;

     Many data sets to be clustered are stored in rela-                   • X ⊂ Rd — training set for data vectors;
     tional databases. Having a clusterization algo-                      • i ∈ N : 1 ⩽ i ⩽ n — vector subscript in a training
     rithm implemented in SQL provides easier clus-                         set;
     terization inside a relational DBMS than out-
     side with some alternative tools. In this pa-                        • xi ∈ X — the i-th vector in the sample;
     per we propose Fuzzy c-Means clustering algo-
                                                                          • k ∈ N — number of clusters;
     rithm adapted for PostgreSQL open-source re-
     lational DBMS.                                                       • j ∈ N : 1 ⩽ j ⩽ k — cluster number;
                                                                          • C ⊂ Rk×d — matrix with clusters’ centers (cen-
1 Introduction                                                              troids);
Integrating clustering algorithms is a topic/xplore/al is-
                                                                          • cj ∈ Rd — center of cluster j, d-dimensional vec-
sue for database programmers [11]. Such an approach,
                                                                            tor;
on the one hand, encapsulates DBMS internal details
from application programmer. On the other hand, it al-                    • xil , cjl ∈ R — l-s coordinates of vectors xi and cj
lows to avoid overhead connected with export data out-                      respectively;
side a relational DBMS. The Fuzzy c-Means (FCM) [9,
6, 2] clustering algorithm provides a fuzzy clustering of                 • U ⊂ Rn×k — matrix with membership degrees,
data. Currently this algorithm have many implementa-                        where uij ∈ R: 0 ⩽ uij ⩽ 1 is a membership
tions on a high-level programming languages [5, 7]. For                     degree between vector xi and cluster j;
implementation the FCM algorithm in SQL we choose
                                                                          • ρ(xi , cj ) — distance function, defines a member-
an open-source PostgreSQL DBMS [15].
                                                                            ship degree between vector xi and cluster j;
   The paper is organized as follows. Section 2 intro-
duces basic definitions and an overview of the FCM al-                    • m ∈ R : m > 1 — the fuzzyfication degree of ob-
gorithm. Section 3 proposes implementation of the FCM                       jective function;
in SQL called pgFCM. Section 4 briefly discusses related
work. Section 5 contains conclusion remarks and direc-                    • JF CM — objective function.
tions for future work.
                                                                           The FCM is based on minimization of the objective
2 The Fuzzy c-Means Algorithm                                           function JF CM :
K-Means [10] is one of the most popular clustering algo-                                              ∑
                                                                                                      N ∑
                                                                                                        k
rithms, it is a simple and fairly fast [3]. The FCM algo-                      JF CM (X, k, m) =                    um   2
                                                                                                                     ij ρ (xi , cj )   (1)
rithm generalizes K-Means to provide fuzzy clustering,                                                i=1 j=1
where data vectors can belong to several partitions (clus-
ters) at the same time with a given weight (membership                     Fuzzy clusterization is carried out through an iterative
degree). To describe FCM we use the following notation:                 optimization of the objective function (1). Membership
                                                                        matrix U and centroids cij are updated using the follow-
  • d ∈ N — dimensionality of a data vectors (or data                   ing formulas:
    items) space;                                                                               k (
                                                                                                ∑              )        2
                                                                                                    ρ(xi , cj ) 1−m
                                                                                      uij =                                            (2)
  • l ∈ N : 1 ⩽ l ⩽ d — subscript of the vector’s coor-                                               ρ(xi , ct )
                                                                                                t=1
    dinate;
                                                                                                         ∑
                                                                                                         n
    ∗ This paper is supported by the Russian Foundation for Basic Re-                                           ij · xil
                                                                                                               um
                                                                                                         i=1
search (grant No. 09-07-00241-a).                                                       ∀j, l    cjl =     ∑
                                                                                                           n                           (3)
Proceedings of the Spring Researcher’s Colloquium on Database                                                    um
                                                                                                                  ij
and Information Systems, Moscow, Russia, 2011                                                              i=1
                                        Table 2: Relational Tables of pgFCM Algorithm

     No. Table                                    Semantics                                Columns             Number of rows
      1     SH training set for data vectors (horizontal form)                         i, x1, x2, . . . , xd           n
      2     SV       training set for data vectors (vertical form)                          i, l, val                 n·d
      3      C       centroids’ coordinates                                                 j, l, val                 k·d
      4     SD distances between xi and cj                                                  i, j, dist                n·k
      5      U       degree of membership vector xi to a cluster cj on step s               i, j, val                 n·k
      6     UT       degree of membership vector xi to a cluster cj on step s+1             i, j, val                 n·k
      7      P       result of computation function δ (6) on step s                     d, k, n, s, delta              s
                                            (s)     (s+1)
   Let s is a number of iteration, uij and uij   are el-
                                                                                 Table 1: Data Elements Subscripts
ements of matrix U on steps s and s+1 respectively, and
ε ∈ (0, 1) ⊂ R is a termination criterion. Then the ter-               Subscript     Range                 Semantics
mination condition can be written as follows:
                                                                          i           1, n       vector subscript
                           (s+1)      (s)                                 j           1, k       cluster subscript
                  max{|uij         − uij |} < ε               (4)
                    ij                                                    l           1, d       vector’s coordinate subscript

Objective function (1) converges to a local minimum (or                 To compute the termination criterion 4 we introduce
a saddle point) [1].                                                 the function δ as follows:
                                                                                                    (s+1)       (s)
Algorithm 1 The Fuzzy c-Means Algorithm                                              δ = max{|uij           − uij |}             (6)
                                                                                            ij
Input: X, k, m, ε
Output: U                                                            3.2 Database Scheme
 1: s := 0, U (0) := (uij ) {initialization}
                                                                     Table 2 summarizes database scheme of pgFCM algo-
 2: repeat
                                                                     rithm (underlined columns are primary keys).
 3:    {computation of new centroids’ coordinates}                       In order to store sample of a data vectors from set X it
       Compute C (s) := (cj ) using formula (3)                      is necessary to define table SH(i, x1, x2, . . . , xd). Each
       where uij ∈ U (s)                                             row of sample stores vector of data with dimension d and
 4:    {update matrixes values}                                      subscript i. Table SH has n rows and column i as a
       Compute U (s) and U (s+1) using formula (2)                   primary key.
 5:    s := s + 1                                                        FCM steps demand aggregation of vector coordinates
                   (s)    (s−1)
 6: until max{|uij − uij        |} ⩾ ε                               (sum, maximum, etc.) from set X. However, because
            ij
                                                                     of its definition, table SH does not allow using SQL ag-
                                                                     gregation functions. To avoid this obstacle we define a
    Algorithm 1 shows the basic FCM. The input
                                                                     table SV (i, l, val), which contains n·d rows and have
of algorithm receives a set of data vectors X =
                                                                     a composite primary key (i, l). Table SV represents a
(x1 , x2 , . . . , xn ), number of clusters k, fuzzyfication de-
                                                                     data sample from table SH ans supports SQL aggrega-
gree m, and termination criterion ε. The output is a ma-
                                                                     tion functions max and sum.
trix of membership degrees U .
                                                                         Due to store coordinates of cluster centroids tempo-
                                                                     rary table C(j, l, val) is defined. Table C has k·d rows
3 Implementation of Fuzzy c-Means Algo-                              and the composite primary key (j, l). Like the table SV ,
  rithm using PostgreSQL                                             structure of table C allows to use aggregation functions.
                                                                         In order to store distances ρ(xi , cj ) ta-
In this section we suggest pgFCM algorithm as a way to               ble SD(i, j, dist) is used. This table has n·k rows
integrate FCM algorithm with PostgreSQL DBMS.
                                                                     and the composite primary key (i, j).
                                                                         Table U (i, j, val) stores membership degrees, calcu-
3.1 General Definitions                                              lated on s-th step. To store membership degrees on
To integrate FCM algorithm with a relational DBMS it is              s+1 step similar table U T (i, j, val) is used. Both tables
necessary to perform matrixes U and X as relational ta-              have n·k rows and the composite primary key (i, j).
bles. Subscripts for identification elements of relational               Finally, table P (d, k, n, s, delta) stores iteration num-
tables are presented in Table 1 (numbers n, k, d a defined           ber s and the result of computation function (6) delta for
above in a section 2).                                               this iteration number. Number of rows in table P de-
   As a function of distance ρ(xi , cj ), without loss of            pends on the number of iterations.
generality, we use the Euclidean metric:
                                                                     3.3 The pgFCM Algorithm
                               v
                               u d                                   The pgFCM algorithm is implemented by means of a
                               u∑
                 ρ(xi , cj ) = t   (xil − cjl )2              (5)    stored function in PL/pgSQL language. Algorithm 2
                                l=1                                  shows the main steps of the pgFCM.
Algorithm 2 The pgFCM Algorithm                              C1: INSERT INTO C
Input: SH, k, m, eps                                               SELECT R1.j, l, R1.s1 / R2.s2 AS val
Output: U                                                          FROM (SELECT l, j,
 1: {initialization}                                                            sum(U.val^2 * SV.val)
    Create and initialize temporary tables (U, P, SV ,                            AS s1
    etc.)                                                                FROM U, SV
 2: repeat                                                               WHERE U.i = SV.i
 3:    {computations}                                                    GROUP BY l, j) AS R1,
 4:    Compute centroids coordinates. Update table C.                   (SELECT j, sum(U.val^2) AS s2
 5:    Compute distances ρ(xi , cj ). Update table SD.                   FROM U
 6:    Compute membership degrees U T = (utij ).                         GROUP BY j) AS R2
       Update table U T .                                          WHERE R1.j = R2.j;
 7:    {update}
       Update tables P and U .                               C2: INSERT INTO SD
 8:    {check for termination}                                     SELECT i, j,
 9: until P.delta ⩾ eps                                                   sum((SV.val - C.val)^2)
                                                                            AS dist
   The input set of data vectors X stored in table SH.             FROM SV, C
Fuzzyfication degree m, termination criterion ε, and               WHERE SV.l = C.l;
number of clusters k are function parameters. The ta-              GROUP BY i, j;
ble U contains a result of pgFCM work.
                                                                Through the FCM, computations of the distances pro-
3.4 Initialization                                           vide by formula (2). Let us notice that in formula (3) the
                                                             fraction’s numerator does not depend on t, and rewrite
Initialization of tables SV , U , and P provided by SQL-     this formula as follows:
code I1, I2, and I3 respectively. Table SV is formed by
sampling records from the table SH.                                             2                ∑
                                                                                                 k
                                                                                                          2
                                                                      uij = ρ 1−m (xi , cj ) ·         ρ m−1 (xi , ct )   (7)
I1: INSERT INTO SV                                                                               t=1
      SELECT SH.i, 1, x1 FROM SH;
    ...                                                         So, the computation of membership degrees can be
    INSERT INTO SV                                           written as follows:
      SELECT SH.i, d, xd FROM SH;
                                                             C3: INSERT INTO UT
   For table U a membership degree between data vec-
                                                                   SELECT i, j,
tor xi and cluster j takes 1 for all i = j.
                                                                          SD.dist^(2.0^(1.0-m))
I2: INSERT INTO U (i, j, val)                                               * SD1.den AS val
      VALUES (1, 1, 0);                                            FROM (SELECT i,
    ...                                                                          sum(dist^(2.0^(m-1.0)))
    INSERT INTO U (i, j, val)                                                      AS den
      VALUES (j, j, 1);                                                  FROM SD
    ...                                                                  GROUP BY i) AS SD1, SD
    INSERT INTO U (i, j, val)                                      WHERE SD.i = SD1.i;
      VALUES (n, k, 0);
    In other words, as a start coordinates of centroids,     3.6 Update
first d data vectors from sample X are used.                 Update stage of the pgFCM modifies P and U tables as
             ∀i=j       uij = 1 ⇒ cj = xi                    shown below in U1 and U2 respectively.

    When initializing the table P , the number of points k   U1: INSERT INTO P
is taken as a parameter of the function pgF CM . A data            SELECT L.d, L.k, L.n, L.s + 1 AS s,
vectors space dimensionality d and a cardinal number of                   E.delta
the training set n also provided by function pgF CM pa-            FROM (SELECT i, j,
rameters. The iteration number s and delta initializes as                       max(UT.val - U.val)
zeros.                                                                            AS delta
I3: INSERT INTO P(d, k, n, s, delta)                                     FROM U, UT
      VALUES (d, k, n, 0, 0);                                            GROUP BY i, j) AS E,
                                                                        (SELECT d, k, n, max(s)
3.5 Computations                                                         FROM P
                                                                         GROUP BY d, k, n) AS L) AS R
According to Algorithm 2, the computation stage is split-
ted to the following three sub-steps: computation coor-         Table U T stores temporary membership degrees to be
dinates of centroids, computation of distances, and com-     inserted into table U . To provide the rapid removal all the
putation membership degrees, marked as C1, C2, and C3        table U rows, obtained at the previous iteration, we use
respectively.                                                the truncate operator.
U2: TRUNCATE U;                                              References
    INSERT INTO U
                                                              [1] J. Bezdek, R. Hathaway, M. Sobin, and W. Tucker.
      SELECT * FROM UT;
                                                                  Convergence Theory for Fuzzy c-means: Coun-
                                                                  terexamples and Repairs. IEEE Trans. Syst. Man
3.7 Check                                                         Cybern., 17:873–877, October 1987.

This stage is the final for the algorithm pgFCM. On each      [2] J. C. Bezdek. Pattern Recognition with Fuzzy Ob-
iteration the termination condition (4) must be checked.          jective Function Algorithms. Kluwer Academic
    To implement the check, the result delta of the func-         Publishers, Norwell, MA, USA, 1981.
tion (6) from table P is stored in the temporary vari-
able tmp.                                                     [3] P. S. Bradley, U. M. Fayyad, and C. Reina. Scal-
                                                                  ing Clustering Algorithms to Large Databases. In
                                                                  KDD, pages 9–15, 1998.
CH1: SELECT delta INTO tmp
     FROM P, (SELECT d, k, n,                                 [4] J. Clear, D. Dunn, B. Harvey, M. Heytens,
                     max(s) AS max_s                              P. Lohman, A. Mehta, M. Melton, L. Rohrberg,
              FROM P                                              A. Savasere, R. Wehrmeister, and M. Xu. Non-
              GROUP BY d, k, n) AS L                              Stop SQL/MX primitives for knowledge discovery.
     WHERE P.s = L.max_s AND P.d = L.d                            In Proceedings of the fifth ACM SIGKDD interna-
     AND P.k = L.k AND P.n = L.n;                                 tional conference on Knowledge discovery and data
                                                                  mining, KDD ’99, pages 425–429, New York, NY,
   After selecting the delta, we need to check the condi-         USA, 1999. ACM.
tion δ < ε. Then if this condition is true we should stop,
otherwise, work will be continued.                            [5] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer,
                                                                  and Weingessel A. Machine Learning Open-Source
CH2: IF (tmp < eps) THEN                                          Package ‘r-cran-e1071’, 2010. http://cran.r-project.
       RETURN;                                                    org/web/packages/e1071/index.html.
     END IF;
                                                              [6] J. C. Dunn. A Fuzzy Relative of the ISODATA
                                                                  Process and Its Use in Detecting Compact Well-
    The final result of the algorithm pgFCM will be stored
                                                                  Separated Clusters. Journal of Cybernetics, 3:32–
in table U .
                                                                  57, 1973.
                                                              [7] Apache Software Foundation, I. Drost, T. Dunning,
4 Related Work                                                    J. Eastman, O. Gospodnetic, G. Ingersoll, J. Man-
Research on integrating data mining algorithms with re-           nix, S. Owen, and K. Wettin. Apache Mahout,
lational DBMS includes the following. Association rules           2010. https://cwiki.apache.org/confluence/display/
mining is explored in [13]. General data mining primi-            MAHOUT/Fuzzy+K-Means.
tives are suggested in [4]. Primitives for decision trees
                                                              [8] G. Graefe, U. M. Fayyad, and S. Chaudhuri. On
mining are proposed in [8].
                                                                  the Efficient Gathering of Sufficient Statistics for
    Our research was inspired by papers [11, 12], where           Classification from Large SQL Databases. In KDD,
integrating K-Means clustering algorithm with relational          pages 204–208, 1998.
DBMS, was carried out. The way we exploit is similar
to mentioned above. The main contribution of the pa-          [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clus-
per is an extension of results presented in [11, 12] for          tering: a review. ACM Comput. Surv., 31:264–323,
the case where data vectors may belong to several clus-           September 1999.
ters. Such a case is very important in problems connected
with medicine data analysis [14, 16]. To the best of our     [10] J. B. MacQueen. Some Methods for Classifica-
knowledge there are no papers devoted to implementing             tion and Analysis of MultiVariate Observations. In
fuzzy clustering with relational DBMS.                            L. M. Le Cam and J. Neyman, editors, Proc. of the
                                                                  fifth Berkeley Symposium on Mathematical Statis-
                                                                  tics and Probability, volume 1, pages 281–297.
5 Conclusion                                                      University of California Press, 1967.
In this paper we have proposed the pgFCM algorithm.          [11] C. Ordonez. Programming the K-means cluster-
pgFCM implements Fuzzy c-Means clustering algorithm               ing algorithm in SQL. In W. Kim, R. Kohavi,
and processes data stored in relational tables using Post-        J. Gehrke, and W. DuMouchel, editors, KDD, pages
greSQL open-source DBMS. There are following issues               823–828. ACM, 2004.
to continue our research. Firstly, we plan to investigate
pgFCM scalability using both synthetical and real data       [12] C. Ordonez. Integrating K-Means Clustering with a
sets. The second direction of our research is develop-            Relational DBMS Using SQL. IEEE Trans. Knowl.
ing a parallel version of pgFCM for distribution memory           Data Eng., 18(2):188–201, 2006.
multiprocessors.
                                                             [13] S. Sarawagi, S. Thomas, and R. Agrawal. Integrat-
                                                                  ing association rule mining with relational database
     systems: alternatives and implications. In Pro-
     ceedings of the 1998 ACM SIGMOD international
     conference on Management of data, SIGMOD ’98,
     pages 343–354, New York, NY, USA, 1998. ACM.
[14] A. I. Shihab. Fuzzy Clustering Algorithms and their
     Applications to Medical Image Analysis. PhD the-
     sis, University of London, 2000.
[15] M. Stonebraker, L. A. Rowe, and M. Hirohama.
     The Implementation of POSTGRES. IEEE Trans.
     on Knowl. and Data Eng., 2:125–142, March 1990.
[16] D. Zhang and S. Chen. A Novel Kernelized Fuzzy
     c-Means Algorithm with Application in Medical
     Image Segmentation. Artificial Intelligence in
     Medicine, 32:37–50, 2004.