C-NBC: Neighborhood-Based Clustering with
                       Constraints

                                         Piotr Lasek

                     Chair of Computer Science, University of Rzeszów
                      ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland
                                    lasek@ur.edu.pl


       Abstract. Clustering is one of most important methods of data mining. It is
       used to identify unknown yet interesting and useful patterns or trends in da-
       tasets. There are different types of clustering algorithms such as partitioning,
       hierarchical, grid and density-based. In general, clustering methods are consid-
       ered unsupervised, however, in recent years the new branch of clustering algo-
       rithms has emerged, namely constrained clustering algorithms. By means of so-
       called constraints, it is possible to incorporate background knowledge into clus-
       tering algorithms which usually leads to better performance and accuracy of
       clustering results. Through the last years, a number of clustering algorithms
       employing different types of constraints have been proposed and most of them
       extend existing partitioning and hierarchical approaches. Among density-based
       methods using constraints algorithms such as C-DBSCAN, DBCCOM, DBCluC
       were proposed. In this paper we offer a new C-NBC algorithm which combines
       known neighborhood-based algorithm (NBC) and instance-level constraints.

       Keywords: data mining, clustering, semi-supervised learning, constraints


1      Introduction

Clustering is one of well-known and often used data mining methods. Its goal is to
assign data objects (or points) to different clusters so that objects that were assigned
to the same clusters are more similar to each other than to objects assigned to another
clusters [1]. Clustering algorithms can operate on different types of data sources such
as databases, graphs, text, multimedia, or on any dataset containing objects that could
be described by a set of features or relationships [2]. Since clustering algorithms do
not require any external knowledge to be run (e.g. except parameters like k in the k-
Means algorithm), the process of clustering, in opposite to classification, is also often
referred to as an unsupervised learning. However, there has always been a natural
need to incorporate already collected knowledge into algorithms to make them better
both in terms of efficiency and quality of results. This need leads to the construction
of a new branch of clustering algorithms based on so-called constraints.
   Constraint-based clustering algorithms employ the fact, that in many applications,
the domain knowledge in the form of labeled objects is already known or could be
easily specified by domain experts. Moreover, in some cases such knowledge can be
automatically detected. Initially, researchers focused on algorithms that incorporated
pairwise constraints on cluster membership or learned distance metrics. Subsequent
research was related to algorithms that used many different kinds of domain
knowledge [3].
    The major contribution of this work is the offering of a constrained neighborhood-
based clustering algorithm called C-NBC which combines the known NBC algo-
rithm [5] and instance-level constraints such as must-link and cannot-link.
    This paper is divided into five sections. In Section 1 we have given a brief intro-
duction to clustering with constraints. Next, in Section 2, we shortly describe most
known types of constraints and focus on instance-level constraints such as must-link
and cannot-link constraints. In Section 3, the Neighborhood-Based Clustering (NBC)
is reminded. Further, in Section 4 we present the proposed C-NBC algorithm. Conclu-
sions are drawn and further research is briefly discussed in Section 5.


2      Instance-Level Constraints

In constrained clustering background knowledge can be incorporated into algorithms
by means of different types of constraints. Through the years, different methods of
using constraints in clustering algorithms have been developed [3]. Constraint-based
methods proposed so far employ techniques such as modifying the clustering objec-
tive function including a penalty for satisfying specified constraints [7], clustering
using conditional distributions in an auxiliary space, enforcing all constraints to be
satisfied during clustering process [8] or determining clusters and constraints based on
neighborhoods derived from already available labeled examples [9]. On the other
hand, in the distance based methods, the distance measure is designed so that it satis-
fies given constraints [10, 11]. A few density-based constrained algorithms such as
C-DBSCAN [4], DBCCOM [16] or DBCluC [17], have also been proposed.

        Table 1. The notations related to instance-level constraints used in this work

      Notation                                       Description
                      The set of pairs of points that are in a must-link relation.
                      Two points      and     are in a must-link relation (must be assigned to
                      the same resulting cluster).
                      The set of points which are in a must-link relation with point .
                      The set of pairs of points that are in a cannot-link relation.
                      Two points       and    are in a cannot-link relation (must not be as-
                      signed to the same resulting cluster).
                      The set of points which are in a cannot-link relation with point .

   Several types of constraints are known, but the hard instance-level constraints seem
to be most useful since the incorporation of just few constraints of this type can in-
crease clustering accuracy as well as decrease runtime. In [6] authors introduced two
kinds of instance-level constraints: the must-link and cannot-link constraints. These
constraints are simple, however they have interesting properties. For example must-
link constraints are symmetrical, reflexive and transitive, similarly to an equivalence
relation. Generally, if two points, say     and , are in a must-link relationship (or, in
other words, are connected by a must-link constraint), then these points, in a process
of clustering, must be assigned to the same cluster . On the other hand, if two points,
say and , are in a cannot-link relationship (or are connected by a cannot-link con-
straint), then these points must not be assigned to the same cluster .
   In the next part of the paper, when referring to must-link and cannot-link con-
straints, we will use the notations presented in Table 1.


3      Neighborhood-Based Clustering

The Neighborhood-Based Clustering (NBC) algorithm [5] belongs to the group of
density based clustering algorithms. The characteristic feature of NBC is the ability to
measure relative local densities. Hence, it is capable of discovering clusters of differ-
ent local densities and of arbitrary shape. Below we remind the key definitions related
to the NBC algorithm which will be used in the sequel.
Definition 1. (ε-neighborhood, or briefly                  ) ε-neighborhood of point
(            ) is the set of all points in dataset that are distant from by no more than
  ; that is,                                      , where       is a distance function.
Definition 2. ( -neighborhood, or           in brief)   -neighborhood of point
(        ) is a set of     (      ) points satisfying the following conditions:
               (),                                            ().
Definition 3. (punctured -neighborhood, or briefly                    )    -neighborhood
of point (             ) is equal to         , where
          .
Definition 4. (reversed punctured    -neighborhood of a point , or                 in
brief) Reversed punctured k+-neighborhood of a point (            ) is the set of all
points          in dataset     such that                ; that is:
                      .
Definition 5. (neighborhood-based density factor of a point) Neighborhood-based
density factor of a point       (        ) is defined as
             . (Points having the value of the NDF factor equal to or greater than 1,
are considered as dense.)

   In order to find clusters, NBC starts with calculating values of             factors for
each point      in a database ,                  . Next, for each , a value of           is
checked. If it is greater than or equal to , then is assigned to the currently created
cluster (identified by the value of                ). Next, the temporary variable
for storing references to points, is cleared and each point, say , belonging to
             is assigned to . If         is greater than or equal to , then is also add-
ed to         . Otherwise, is omitted and a next point from                   is analyzed.
Further, for each point from          , say ,                is computed. All unclassified
points belonging to                  are assigned to and points having values of
greater than or equal to 1 are added to          . Next, is removed from       . When
        is empty,               is incremented and a next point from , namely       , is
analyzed. Finally, if there are no more points in having values of        factor great-
er than or equal to 1, then all unclassified points in are marked as NOISE.


4       Clustering with Constraints

In this section we offer our new neighborhood-based constrained clustering algorithm
called C-NBC. The algorithm is based on the NBC algorithm [5] but uses both must-
link and cannot-link constraints for incorporating knowledge into the algorithm.
  The C-NBC algorithm employs the same definitions as NBC which are used in a
process of clustering to assign points to appropriate clusters or mark them as noise. In
NBC three kinds of points can be distinguished: unclassified, classified and noise
points. In our proposal, we introduced another kind of points, namely deferred points
(Definition 6).
Definition 6. (deferred point) A point is called deferred if it is involved in a cannot-
link relationship with any other point or it belongs to a    -punctured neighborhood
             , where is any point involved in a cannot-link relationship.
   C-NBC (Figure 1) can be divided into two main steps. In the first step the algo-
rithms works very similarly to NBC. It calculates NDF factors and performs cluster-
ing. The difference between C-NBC and NBC is that the former additionally deter-
mines deferred points and has to deal with must-link constraints when building clus-
ters. In spite of the fact that in the first step the deferred points are not assigned to any
cluster, these points are normally used to calculate NDF factors. In the second step
C-NBC works so that it allocates deferred points to appropriate clusters. This is done
by means of the AssignDeferredPointsToClusters function (Figure 2) which assigns
points to clusters after checking if such an assignment if feasible. The detailed de-
scription of C-NBC is provided beneath. The key variables and notations related to
C-NBC are explained in Table 2.

    Table 2. The auxiliary variables and notations used in pseudo-code of C-NBC if Figure 1

 Variable or notation                                Description
                           The auxiliary integer variable used for storing currently-
                           created cluster’s identifier.
                           By using such a notation we refer to a              related to
                           point .
                           Such a notation is used to refer to a value of the NDF factor
                           associated with point .
             ,             The auxiliary variables for storing deferred points.
                           The variable for storing dense points. It is used for in an
                           iterative process of assigning points to clusters.
Algorithm C-NBC( , , , )
Input:                the input not clustered dataset
                      the parameter of the C-NBC
               ,      the sets of pairs of points connected by must-link or cannot-link constraints
Output: The clustered set with clusters satisfying cannot-link and must-link constraints.
            ;                  = 0; // initialization of a set of deferred points     and
   label all points in as UNCLASSIFIED; //                              = UNCLASSIFIED
   CalcNDF( , ); // calculate values of NDF factor for every point in
┌ for each point involved in any constraint from                 or    do
│      label and points in                     as DEFERRED // label points as DEFERRED
│      add to ; // add DEFERRED points to
└ endfor
                 = 0;
┌ for each unclassified point in such that                           do
│                        =             ;
│      clear           ;
│ ┌ for each point q                             do
│ │                          =              ; // set point’s cluster identifier
│ │         if (q.ndf ≥ 1) then add to               ; endif
│ │         add all points from               such that               to        ;
│ └ endfor
│ ┌ while (                   ) do           // expanding a currently created cluster
│ │           = first point in            ;
│ │ ┌ for each unclassified point in                                 do
│ │ │                            =              ; // set point’s cluster identifier
│ │ │            if (              ) then add to             ; endif
│ │ │            add all points from              such that               to        ;
│ │ └ endfor
│ │         remove from                 ;
│ └ endwhile
│                   ++;
└ endfor
   label unclassified points in as NOISE;
   AssignDeferredPointsToClusters( , , , );

Fig. 1. The pseudo-code of the C-NBC algorithm. Most important differences between NBC
and C-NBC are marked with gray color

   The C-NBC algorithm starts with the CalcNDF function. After calculating the NDF
factors for each point from D, the deferred points are determined by scanning pairs of
cannot-link connected points. Such a points are added to an auxiliary set .
  Then, the clustering process is performed in the following way: for each point
which was not marked as DEFERRED, it is checked if                     is less than . If
            , then is omitted and a next from            is checked. If              , then
  as a dense point is assigned to the currently-created cluster identified by the current
value of            .
   Next, the temporary variable           is cleared and each not deferred point, say ,
belonging to                   is assigned to the currently-created cluster identified by
the current value of the               variable. Additionally, if              , then it is
assigned to         as well as all dense points which are in a must-link relation with .
Function AssignDeferredPointsToClusters( , , , )
Input:             the input not clustered dataset
                   the set of points marked as deferred
                   the parameter of the C-NBC algorithm
                   the set of cannot-link constraints
Output: The clustered set with clusters satisfying cannot-link and must-link constraints
       ;       // saving contents of     in a temporary set
┌ do begin
│             ; // a temporary set for storing deferred points assigned to any cluster
│ ┌ foreach point in do begin
│ │ ┌ foreach point in                       do
│ │ │ ┌ if (               ≥ 1 and                > 0 and          )
│ │ │ │ ┌ if (CanBeAssigned( ,                          )) and // checking if p can be assigned to
│ │ │ │ │                                                       // cluster identified by q.clusterId
│ │ │ │ │                (CanBeAssigned(p.nearestCannotLinkPoint, q.ClusterId)) then
│ │ │ │ │                              =               add p to ;
│ │ │ │ │               break;
│ │ │ │ └ endif
│ │ │ └ endif
│ │ └ endfor
│ │       remove        from ;
│ └ endfor
└ while (          )

           Fig. 2. The pseudo-code of the AssignDeferredPointsToClusters function

   Next, for each unclassified point from            , say , its punctured     -neighbor-
hood is determined. Each point, say , which belongs to this neighborhood and has
not been labeled as deferred yet is assigned to the currently created cluster and if its
value of NDF is equal to or greater than 1, is added to             . Moreover, all dense
points which are in a must-link relation with are added to             as well. Later, s is
removed from             and next point from          is processed. When          is emp-
tied, then             is incremented.
   After all points from are processed, unclassified points are marked as noise by
setting the values of their            attribute to NOISE. However, in order to process
the deferred points, the AssignDeferredPointsToCluster function is invoked. The
function performs so that for each deferred point it finds the nearest point as-
signed to any cluster and checks whether it is possible (in accordance with cannot-link
constraints) to assign to the same cluster as . Additionally, the function checks if
the assignment of to a specific cluster will not violate previous assignments of de-
ferred points.
   In order to test the efficiency of the proposed C-NBC algorithm we performed a
number of basic experiments using the following datasets: a manually created dataset
containing 2800 data points, the known four dimensional iris dataset [13] and the
birch dataset [14]. The main goal of the experiments was to test the influence of a
number of constraints used in the process of clustering on the efficiency of the pro-
posed algorithm. In Table 2 we compare the run-times of the NBC and C-NBC algo-
rithms. Both implementations of the algorithms employ the same index structure,
namely the R-Tree [15]. When performing experiments using C-NBC we were chang-
ing the number of must-link and cannot-link constraints from 2 to 250 except for the
Iris dataset, which was not numerous enough to use number of both types of con-
straints greater than 50. Since additional operations must be performed, it was obvi-
ous for us that the number of constraints would have a negative impact on the effi-
ciency of the algorithm. However, we did not encounter any crucial increase (at most
about 2 times for the largest dataset and 250 must-link and 250 cannot-link con-
straints) in the run-time of clustering using the C-NBC algorithm. For this reason,
similarly to NBC, C-NBC can be considered as effective, efficient, and easy to use.

Table 3. The results of the experiments. Run-times are given in milliseconds. The value of k
(the parameter of C-NBC algorithm) is equal to 15. Number of constraints concerns both must-
link and cannot-link constraints. |D| – number of points in a dataset, d – number of dimensions,
* - a number of constraints was greater than the cardinality of a dataset

                                                            Run-times
                     Number of                                                      Birch
    Algorithm                        Manually created             Iris
                     constraints                                                |D| = 10000,
                                     |D| = 2800, d = 2     |D| = 250, d = 4
                                                                                    d=2
      NBC                 0                 3284                 67                13385
                          2                 3500                 101               14132
                         10                 4172                 105               14969
     C-NBC
                         50                 6067                  *                21360
                         250                6457                  *                29931


5      Conclusions and further research

In recent years a number of algorithms employing different kinds of constraints have
been proposed. Most commonly used constraints are instance constraints which are
widely employed in enriching clustering algorithms to utilize additional background
knowledge. However, among density based clustering algorithms, so far only the
DBSCAN algorithm [12] was adapted to use instance constraints.
   In this paper we have offered C-NBC, the density based algorithm for clustering
under instance constraints which was based on the known NBC algorithm. Our pre-
liminary experiments, confirm that the algorithm performs in line with the assump-
tions. In other words, it works so that must-link connected points are assigned to the
same clusters and cannot-link connected points are assigned to different clusters.
Moreover, for datasets tested, it turns out that constraints have a little effect on the
efficiency of the algorithm. In spite of this, in our future research we would like to
still focus on performance of constrained based clustering and on proposing other
methods of ensuring constraints during clustering process for larger and high dimen-
sional datasets.
References
 1. J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. San Fran-
    cisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.
 2. S. Basu, I. Davidson, and K. Wagstaff, Constrained Clustering: Advances in Algorithms,
    Theory, and Applications, 1 edition, vol. 45. 2008, pp. 961–970.
 3. I. Davidson and S. Basu, “A Survey of Clustering with Instance Level Constraints,” ACM
    T. Knowl. Disc. Data, vol. w, pp. 1–41, 2007.
 4. C. Ruiz, M. Spiliopoulou, and E. Menasalvas, “C-DBSCAN: Density-Based Clustering
    with Constraints,” in RSFDGr’07: Proc. of the International Conf. on Rough Sets, Fuzzy
    Sets, Data Mining and Granular Computing held in JRS07, 2007, vol. 4481, pp. 216–223.
 5. S. Zhou, Y. Zhao, J. Guan, and J. Huang, “A Neighborhood-Based Clustering Algorithm,”
    in Advances in Knowledge Discovery and Data Mining SE - 43, vol. 3518, T. Ho, D.
    Cheung, and H. Liu, Eds. Springer Berlin Heidelberg, 2005, pp. 361–371.
 6. K. Wagstaff and C. Cardie, “Clustering with Instance-level Constraints,” in Proceedings of
    the Seventeenth International Conference on Machine Learning, 2000, pp. 1103–1110.
 7. I. Davidson and S. Ravi, “Clustering With Constraints: Feasibility Issues and the k-Means
    Algorithm,” in Proceedings of the 2005 SIAM Int. Conf. on Data Mining, pp. 138–149.
 8. K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, “Constrained K-means Clustering with
    Background Knowledge,” in Proceedings of the Eighteenth International Conference on
    Machine Learning, 2001, pp. 577–584.
 9. S. Basu, A. Banerjee, and R. J. Mooney, “Semi-supervised Clustering by Seeding,” in Pro-
    ceedings of the 19th International Conference on Machine Learning, 2002, pp. 27–34.
10. T. H. Tomboy, A. Bar-Hillel, and D. Weinshall, “Boosting Margin Based Distance Func-
    tions for Clustering,” in In Proceedings of the Twenty-First International Conference on
    Machine Learning, 2004, pp. 393–400.
11. H. Chang and D.-Y. Yeung, “Locally Linear Metric Adaptation for Semi-supervised Clus-
    tering,” in Proc. of the 21st International Conf. on Machine Learning, 2004, pp. 153–160.
12. M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discover-
    ing Clusters in Large Spatial Databases with Noise,” in Second International Conference
    on Knowledge Discovery and Data Mining, 1996, pp. 226–231.
13. C. L. Blake and C. J. Merz, “UCI Repository of machine learning databases,” University of
    California. p. http://archive.ics.uci.edu/ml/, 1998.
14. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH : A New Data Clustering Algorithm
    and Its Applications,” Data Min. Knowl. Discov., vol. 1, pp. 103–144, 1997.
15. A. Guttman, “R Trees: A Dynamic Index Structure For Spatial Searching,” in ACM
    SIGMOD Int. Conf. on Management of Data - SIGMOD ’84, 1984, p. 47–53.
16. Duhan, N., & Sharma, A. K. (2011). DBCCOM: Density Based Clustering with Con-
    straints and Obstacle Modeling. In Contemporary Computing SE - 24 (Vol. 168, pp. 212–
    228). Springer Berlin Heidelberg.
17. Zaiane, O. R., & Lee, C.-H. L. C.-H. (2002). Clustering spatial data when facing physical
    constraints. 2002 IEEE International Conference on Data Mining, 2002. Proceedings.