C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is one of most important methods of data mining. It is used to identify unknown yet interesting and useful patterns or trends in da- tasets. There are different types of clustering algorithms such as partitioning, hierarchical, grid and density-based. In general, clustering methods are consid- ered unsupervised, however, in recent years the new branch of clustering algo- rithms has emerged, namely constrained clustering algorithms. By means of so- called constraints, it is possible to incorporate background knowledge into clus- tering algorithms which usually leads to better performance and accuracy of clustering results. Through the last years, a number of clustering algorithms employing different types of constraints have been proposed and most of them extend existing partitioning and hierarchical approaches. Among density-based methods using constraints algorithms such as C-DBSCAN, DBCCOM, DBCluC were proposed. In this paper we offer a new C-NBC algorithm which combines known neighborhood-based algorithm (NBC) and instance-level constraints. Keywords: data mining, clustering, semi-supervised learning, constraints 1 Introduction Clustering is one of well-known and often used data mining methods. Its goal is to assign data objects (or points) to different clusters so that objects that were assigned to the same clusters are more similar to each other than to objects assigned to another clusters [1]. Clustering algorithms can operate on different types of data sources such as databases, graphs, text, multimedia, or on any dataset containing objects that could be described by a set of features or relationships [2]. Since clustering algorithms do not require any external knowledge to be run (e.g. except parameters like k in the k- Means algorithm), the process of clustering, in opposite to classification, is also often referred to as an unsupervised learning. However, there has always been a natural need to incorporate already collected knowledge into algorithms to make them better both in terms of efficiency and quality of results. This need leads to the construction of a new branch of clustering algorithms based on so-called constraints. Constraint-based clustering algorithms employ the fact, that in many applications, the domain knowledge in the form of labeled objects is already known or could be easily specified by domain experts. Moreover, in some cases such knowledge can be automatically detected. Initially, researchers focused on algorithms that incorporated pairwise constraints on cluster membership or learned distance metrics. Subsequent research was related to algorithms that used many different kinds of domain knowledge [3]. The major contribution of this work is the offering of a constrained neighborhood- based clustering algorithm called C-NBC which combines the known NBC algo- rithm [5] and instance-level constraints such as must-link and cannot-link. This paper is divided into five sections. In Section 1 we have given a brief intro- duction to clustering with constraints. Next, in Section 2, we shortly describe most known types of constraints and focus on instance-level constraints such as must-link and cannot-link constraints. In Section 3, the Neighborhood-Based Clustering (NBC) is reminded. Further, in Section 4 we present the proposed C-NBC algorithm. Conclu- sions are drawn and further research is briefly discussed in Section 5. 2 Instance-Level Constraints In constrained clustering background knowledge can be incorporated into algorithms by means of different types of constraints. Through the years, different methods of using constraints in clustering algorithms have been developed [3]. Constraint-based methods proposed so far employ techniques such as modifying the clustering objec- tive function including a penalty for satisfying specified constraints [7], clustering using conditional distributions in an auxiliary space, enforcing all constraints to be satisfied during clustering process [8] or determining clusters and constraints based on neighborhoods derived from already available labeled examples [9]. On the other hand, in the distance based methods, the distance measure is designed so that it satis- fies given constraints [10, 11]. A few density-based constrained algorithms such as C-DBSCAN [4], DBCCOM [16] or DBCluC [17], have also been proposed. Table 1. The notations related to instance-level constraints used in this work Notation Description The set of pairs of points that are in a must-link relation. Two points and are in a must-link relation (must be assigned to the same resulting cluster). The set of points which are in a must-link relation with point . The set of pairs of points that are in a cannot-link relation. Two points and are in a cannot-link relation (must not be as- signed to the same resulting cluster). The set of points which are in a cannot-link relation with point . Several types of constraints are known, but the hard instance-level constraints seem to be most useful since the incorporation of just few constraints of this type can in- crease clustering accuracy as well as decrease runtime. In [6] authors introduced two kinds of instance-level constraints: the must-link and cannot-link constraints. These constraints are simple, however they have interesting properties. For example must- link constraints are symmetrical, reflexive and transitive, similarly to an equivalence relation. Generally, if two points, say and , are in a must-link relationship (or, in other words, are connected by a must-link constraint), then these points, in a process of clustering, must be assigned to the same cluster . On the other hand, if two points, say and , are in a cannot-link relationship (or are connected by a cannot-link con- straint), then these points must not be assigned to the same cluster . In the next part of the paper, when referring to must-link and cannot-link con- straints, we will use the notations presented in Table 1. 3 Neighborhood-Based Clustering The Neighborhood-Based Clustering (NBC) algorithm [5] belongs to the group of density based clustering algorithms. The characteristic feature of NBC is the ability to measure relative local densities. Hence, it is capable of discovering clusters of differ- ent local densities and of arbitrary shape. Below we remind the key definitions related to the NBC algorithm which will be used in the sequel. Definition 1. (ε-neighborhood, or briefly ) ε-neighborhood of point ( ) is the set of all points in dataset that are distant from by no more than ; that is, , where is a distance function. Definition 2. ( -neighborhood, or in brief) -neighborhood of point ( ) is a set of ( ) points satisfying the following conditions: (), (). Definition 3. (punctured -neighborhood, or briefly ) -neighborhood of point ( ) is equal to , where . Definition 4. (reversed punctured -neighborhood of a point , or in brief) Reversed punctured k+-neighborhood of a point ( ) is the set of all points in dataset such that ; that is: . Definition 5. (neighborhood-based density factor of a point) Neighborhood-based density factor of a point ( ) is defined as . (Points having the value of the NDF factor equal to or greater than 1, are considered as dense.) In order to find clusters, NBC starts with calculating values of factors for each point in a database , . Next, for each , a value of is checked. If it is greater than or equal to , then is assigned to the currently created cluster (identified by the value of ). Next, the temporary variable for storing references to points, is cleared and each point, say , belonging to is assigned to . If is greater than or equal to , then is also add- ed to . Otherwise, is omitted and a next point from is analyzed. Further, for each point from , say , is computed. All unclassified points belonging to are assigned to and points having values of greater than or equal to 1 are added to . Next, is removed from . When is empty, is incremented and a next point from , namely , is analyzed. Finally, if there are no more points in having values of factor great- er than or equal to 1, then all unclassified points in are marked as NOISE. 4 Clustering with Constraints In this section we offer our new neighborhood-based constrained clustering algorithm called C-NBC. The algorithm is based on the NBC algorithm [5] but uses both must- link and cannot-link constraints for incorporating knowledge into the algorithm. The C-NBC algorithm employs the same definitions as NBC which are used in a process of clustering to assign points to appropriate clusters or mark them as noise. In NBC three kinds of points can be distinguished: unclassified, classified and noise points. In our proposal, we introduced another kind of points, namely deferred points (Definition 6). Definition 6. (deferred point) A point is called deferred if it is involved in a cannot- link relationship with any other point or it belongs to a -punctured neighborhood , where is any point involved in a cannot-link relationship. C-NBC (Figure 1) can be divided into two main steps. In the first step the algo- rithms works very similarly to NBC. It calculates NDF factors and performs cluster- ing. The difference between C-NBC and NBC is that the former additionally deter- mines deferred points and has to deal with must-link constraints when building clus- ters. In spite of the fact that in the first step the deferred points are not assigned to any cluster, these points are normally used to calculate NDF factors. In the second step C-NBC works so that it allocates deferred points to appropriate clusters. This is done by means of the AssignDeferredPointsToClusters function (Figure 2) which assigns points to clusters after checking if such an assignment if feasible. The detailed de- scription of C-NBC is provided beneath. The key variables and notations related to C-NBC are explained in Table 2. Table 2. The auxiliary variables and notations used in pseudo-code of C-NBC if Figure 1 Variable or notation Description The auxiliary integer variable used for storing currently- created cluster’s identifier. By using such a notation we refer to a related to point . Such a notation is used to refer to a value of the NDF factor associated with point . , The auxiliary variables for storing deferred points. The variable for storing dense points. It is used for in an iterative process of assigning points to clusters. Algorithm C-NBC( , , , ) Input: the input not clustered dataset the parameter of the C-NBC , the sets of pairs of points connected by must-link or cannot-link constraints Output: The clustered set with clusters satisfying cannot-link and must-link constraints. ; = 0; // initialization of a set of deferred points and label all points in as UNCLASSIFIED; // = UNCLASSIFIED CalcNDF( , ); // calculate values of NDF factor for every point in ┌ for each point involved in any constraint from or do │ label and points in as DEFERRED // label points as DEFERRED │ add to ; // add DEFERRED points to └ endfor = 0; ┌ for each unclassified point in such that do │ = ; │ clear ; │ ┌ for each point q do │ │ = ; // set point’s cluster identifier │ │ if (q.ndf ≥ 1) then add to ; endif │ │ add all points from such that to ; │ └ endfor │ ┌ while ( ) do // expanding a currently created cluster │ │ = first point in ; │ │ ┌ for each unclassified point in do │ │ │ = ; // set point’s cluster identifier │ │ │ if ( ) then add to ; endif │ │ │ add all points from such that to ; │ │ └ endfor │ │ remove from ; │ └ endwhile │ ++; └ endfor label unclassified points in as NOISE; AssignDeferredPointsToClusters( , , , ); Fig. 1. The pseudo-code of the C-NBC algorithm. Most important differences between NBC and C-NBC are marked with gray color The C-NBC algorithm starts with the CalcNDF function. After calculating the NDF factors for each point from D, the deferred points are determined by scanning pairs of cannot-link connected points. Such a points are added to an auxiliary set . Then, the clustering process is performed in the following way: for each point which was not marked as DEFERRED, it is checked if is less than . If , then is omitted and a next from is checked. If , then as a dense point is assigned to the currently-created cluster identified by the current value of . Next, the temporary variable is cleared and each not deferred point, say , belonging to is assigned to the currently-created cluster identified by the current value of the variable. Additionally, if , then it is assigned to as well as all dense points which are in a must-link relation with . Function AssignDeferredPointsToClusters( , , , ) Input: the input not clustered dataset the set of points marked as deferred the parameter of the C-NBC algorithm the set of cannot-link constraints Output: The clustered set with clusters satisfying cannot-link and must-link constraints ; // saving contents of in a temporary set ┌ do begin │ ; // a temporary set for storing deferred points assigned to any cluster │ ┌ foreach point in do begin │ │ ┌ foreach point in do │ │ │ ┌ if ( ≥ 1 and > 0 and ) │ │ │ │ ┌ if (CanBeAssigned( , )) and // checking if p can be assigned to │ │ │ │ │ // cluster identified by q.clusterId │ │ │ │ │ (CanBeAssigned(p.nearestCannotLinkPoint, q.ClusterId)) then │ │ │ │ │ = add p to ; │ │ │ │ │ break; │ │ │ │ └ endif │ │ │ └ endif │ │ └ endfor │ │ remove from ; │ └ endfor └ while ( ) Fig. 2. The pseudo-code of the AssignDeferredPointsToClusters function Next, for each unclassified point from , say , its punctured -neighbor- hood is determined. Each point, say , which belongs to this neighborhood and has not been labeled as deferred yet is assigned to the currently created cluster and if its value of NDF is equal to or greater than 1, is added to . Moreover, all dense points which are in a must-link relation with are added to as well. Later, s is removed from and next point from is processed. When is emp- tied, then is incremented. After all points from are processed, unclassified points are marked as noise by setting the values of their attribute to NOISE. However, in order to process the deferred points, the AssignDeferredPointsToCluster function is invoked. The function performs so that for each deferred point it finds the nearest point as- signed to any cluster and checks whether it is possible (in accordance with cannot-link constraints) to assign to the same cluster as . Additionally, the function checks if the assignment of to a specific cluster will not violate previous assignments of de- ferred points. In order to test the efficiency of the proposed C-NBC algorithm we performed a number of basic experiments using the following datasets: a manually created dataset containing 2800 data points, the known four dimensional iris dataset [13] and the birch dataset [14]. The main goal of the experiments was to test the influence of a number of constraints used in the process of clustering on the efficiency of the pro- posed algorithm. In Table 2 we compare the run-times of the NBC and C-NBC algo- rithms. Both implementations of the algorithms employ the same index structure, namely the R-Tree [15]. When performing experiments using C-NBC we were chang- ing the number of must-link and cannot-link constraints from 2 to 250 except for the Iris dataset, which was not numerous enough to use number of both types of con- straints greater than 50. Since additional operations must be performed, it was obvi- ous for us that the number of constraints would have a negative impact on the effi- ciency of the algorithm. However, we did not encounter any crucial increase (at most about 2 times for the largest dataset and 250 must-link and 250 cannot-link con- straints) in the run-time of clustering using the C-NBC algorithm. For this reason, similarly to NBC, C-NBC can be considered as effective, efficient, and easy to use. Table 3. The results of the experiments. Run-times are given in milliseconds. The value of k (the parameter of C-NBC algorithm) is equal to 15. Number of constraints concerns both must- link and cannot-link constraints. |D| – number of points in a dataset, d – number of dimensions, * - a number of constraints was greater than the cardinality of a dataset Run-times Number of Birch Algorithm Manually created Iris constraints |D| = 10000, |D| = 2800, d = 2 |D| = 250, d = 4 d=2 NBC 0 3284 67 13385 2 3500 101 14132 10 4172 105 14969 C-NBC 50 6067 * 21360 250 6457 * 29931 5 Conclusions and further research In recent years a number of algorithms employing different kinds of constraints have been proposed. Most commonly used constraints are instance constraints which are widely employed in enriching clustering algorithms to utilize additional background knowledge. However, among density based clustering algorithms, so far only the DBSCAN algorithm [12] was adapted to use instance constraints. In this paper we have offered C-NBC, the density based algorithm for clustering under instance constraints which was based on the known NBC algorithm. Our pre- liminary experiments, confirm that the algorithm performs in line with the assump- tions. In other words, it works so that must-link connected points are assigned to the same clusters and cannot-link connected points are assigned to different clusters. Moreover, for datasets tested, it turns out that constraints have a little effect on the efficiency of the algorithm. In spite of this, in our future research we would like to still focus on performance of constrained based clustering and on proposing other methods of ensuring constraints during clustering process for larger and high dimen- sional datasets. References 1. J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. San Fran- cisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011. 2. S. Basu, I. Davidson, and K. Wagstaff, Constrained Clustering: Advances in Algorithms, Theory, and Applications, 1 edition, vol. 45. 2008, pp. 961–970. 3. I. Davidson and S. Basu, “A Survey of Clustering with Instance Level Constraints,” ACM T. Knowl. Disc. Data, vol. w, pp. 1–41, 2007. 4. C. Ruiz, M. Spiliopoulou, and E. Menasalvas, “C-DBSCAN: Density-Based Clustering with Constraints,” in RSFDGr’07: Proc. of the International Conf. on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing held in JRS07, 2007, vol. 4481, pp. 216–223. 5. S. Zhou, Y. Zhao, J. Guan, and J. Huang, “A Neighborhood-Based Clustering Algorithm,” in Advances in Knowledge Discovery and Data Mining SE - 43, vol. 3518, T. Ho, D. Cheung, and H. Liu, Eds. Springer Berlin Heidelberg, 2005, pp. 361–371. 6. K. Wagstaff and C. Cardie, “Clustering with Instance-level Constraints,” in Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 1103–1110. 7. I. Davidson and S. Ravi, “Clustering With Constraints: Feasibility Issues and the k-Means Algorithm,” in Proceedings of the 2005 SIAM Int. Conf. on Data Mining, pp. 138–149. 8. K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, “Constrained K-means Clustering with Background Knowledge,” in Proceedings of the Eighteenth International Conference on Machine Learning, 2001, pp. 577–584. 9. S. Basu, A. Banerjee, and R. J. Mooney, “Semi-supervised Clustering by Seeding,” in Pro- ceedings of the 19th International Conference on Machine Learning, 2002, pp. 27–34. 10. T. H. Tomboy, A. Bar-Hillel, and D. Weinshall, “Boosting Margin Based Distance Func- tions for Clustering,” in In Proceedings of the Twenty-First International Conference on Machine Learning, 2004, pp. 393–400. 11. H. Chang and D.-Y. Yeung, “Locally Linear Metric Adaptation for Semi-supervised Clus- tering,” in Proc. of the 21st International Conf. on Machine Learning, 2004, pp. 153–160. 12. M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discover- ing Clusters in Large Spatial Databases with Noise,” in Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231. 13. C. L. Blake and C. J. Merz, “UCI Repository of machine learning databases,” University of California. p. http://archive.ics.uci.edu/ml/, 1998. 14. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH : A New Data Clustering Algorithm and Its Applications,” Data Min. Knowl. Discov., vol. 1, pp. 103–144, 1997. 15. A. Guttman, “R Trees: A Dynamic Index Structure For Spatial Searching,” in ACM SIGMOD Int. Conf. on Management of Data - SIGMOD ’84, 1984, p. 47–53. 16. Duhan, N., & Sharma, A. K. (2011). DBCCOM: Density Based Clustering with Con- straints and Obstacle Modeling. In Contemporary Computing SE - 24 (Vol. 168, pp. 212– 228). Springer Berlin Heidelberg. 17. Zaiane, O. R., & Lee, C.-H. L. C.-H. (2002). Clustering spatial data when facing physical constraints. 2002 IEEE International Conference on Data Mining, 2002. Proceedings.