Instance-Level Constraints in Density-Based Clustering

                                         Piotr Lasek

                                  Chair of Computer Science,
                                     Rzeszow University
                                      ul. Prof. Pigonia 1,
                                   35-310 Rzeszow, Poland
                                    lasek@ur.edu.pl


       Abstract. Clustering data into meaningful groups is one of most important tasks
       of both artificial intelligence and data mining. In general, clustering methods are
       considered unsupervised. However, in recent years, so-named constraints become
       more popular as means of incorporating additional knowledge into clustering al-
       gorithms. Over the last years, a number of clustering algorithms employing dif-
       ferent types of constraints have been proposed. In this paper we focus on instance
       level constraints such as must-link and cannot-link and present theoretical con-
       siderations on employing them in a well know density-based clustering algorithm
       – DBSCAN. Additionally we present a modified version of the algorithm using
       the discussed type of constraints.

       Key words: data mining, data clustering, supervised clustering, clustering with
       constraints, instance-level constraints


1   Introduction
Clustering is one of well-known and often used data mining methods. Its goal is to
assign data objects (or points) to different clusters so that objects that were assigned
to the same clusters are more similar to each other than to objects assigned to another
clusters [9].
    Clustering algorithms can operate on different types of data sources such as
databases, graphs, text, multimedia, or on any other datasets containing objects that
could be described by a set of features or relationships [2]. Performing a clustering task
over a dataset can lead to discovering unknown yet interesting and useful patterns or
trends in the dataset. Since clustering algorithms do not require any external knowledge
to be run (except simple algorithm’s parameters like k in the k-Means algorithm), the
process of clustering, in opposite to classification, is also often referred to as an unsu-
pervised learning. However, there has always been a natural need to incorporate already
collected knowledge into algorithms to make them better both in terms of efficiency and
quality of results. This need led to the emergence of a new branch of clustering algo-
rithms based on so-named constraints. Constraint-based clustering algorithms employ
the fact, that in many applications, the domain knowledge (e.g. in the form of labeled
objects) is already known or could be easily specified by domain experts. Moreover,
in some cases such knowledge can be automatically detected. Initially, researchers fo-
cused on algorithms that incorporated pairwise constraints on cluster membership or
12

learned distance metrics. Subsequent research was related to algorithms that used many
different kinds of domain knowledge [5].
    The major contribution of this work is the offering of a method of using instance-
level constraints in the DBSCAN algorithm.
    The paper is divided into six sections. In Section 1 we have given a brief intro-
duction to clustering with constraints. Next, in Section 2, we shortly describe most
known types of constraints and focus on instance-level constraints such as must-link
and cannot-link constraints. Section 3 shortly presents reference works in the field of
constrained clustering – especially those which are related to density-based clustering.
In Section 4, the DBSCAN algorithm is reminded. Further, in Section 5 we present the
proposed method. Conclusions are drawn and further research is briefly commented and
discussed in Section 6.


2    Instance-Level Constraints
In constrained clustering algorithms background knowledge can be incorporated into
algorithms by means of different types of so-named constraints. Over the years, dif-
ferent methods of using constraints in clustering algorithms have been developed [5].
Constraint-based methods proposed so far employ techniques such as modifying the
clustering objective function including a penalty for satisfying specified constraints [6],
clustering using conditional distributions in an auxiliary space, enforcing all constraints
to be satisfied during clustering process [13] or determining clusters and constraints
based on neighborhoods derived from already available labeled examples [1].
     Several types of constraints are known. For example, instance constraints describing
relations between objects, distance constraints such as inter–cluster δ–constraints as
well as intra–cluster ǫ–constraints [2]. Nevertheless, the hard instance-level constraints
seem to be most useful since the incorporation of just few constraints of this type can
increase clustering accuracy as well as decrease runtime. In [12] authors introduced two
kinds of instance-level constraints, namely: the must-link and cannot-link constraints.
These constraints are simple but they have interesting properties. For example must-
link constraints are symmetrical, reflexive and transitive, similarly to an equivalence
relation: if two points, say p0 and p1 , are in a must-link relationship (or, in other words,
are connected by a must-link constraint), then these points should be assigned to the
same cluster c by a clustering algorithm. On the other hand, if two points, say r0 and
r1 , are in a cannot-link relationship (or are separated by a cannot-link constraint), then
these points must not be assigned to the same cluster c.


3    Related Works
As mentioned before, in constrained clustering background knowledge can be incor-
porated into algorithms by means of different types of constraints. Through the years,
different methods of using constraints in clustering algorithms have been developed [5].
Constraint-based methods proposed so far employ techniques such as modifying the
clustering objective function including a penalty for satisfying specified constraints [6],
clustering using conditional distributions in an auxiliary space, enforcing all constraints
                                                                                        13

to be satisfied during clustering process [13] or determining clusters and constraints
based on neighborhoods derived from already available labelled examples [1]. On the
other hand, in the distance based methods, the distance measure is designed so that it
satisfies given constraints [10, 4]. It is worth mentioning that among constrained algo-
rithms proposed so far, there are only a few constraint-based representatives from the
interesting for us group of density based algorithms, namely: C-DBSCAN [11], DBC-
COM [7] or DBCluC [14].
     C-DBSCAN is an example of a density-based algorithm using instance-level con-
straints. In C-DBSCAN [4] two types of constraints are supported, namely must-link and
cannot-link constraints. In the first step, the algorithm partitions the dataset using the
KD-Tree [3]. Next, under cannot-link constraints, so-named local clusters are created
by means of density-reachable points. In case there is a cannot-link constraint between
points in the same leaf node of the KD-Tree, then each point in a currently traversed leaf
is labelled as a singleton local cluster. Otherwise, every point p from a leaf is checked,
whether p is a core point or not. If it is not, in other words, the number of points in an
ǫ-neighborhood of p is less than the value of M inP ts, then p is labelled as a NOISE
and is ignored. If it is, then all points that are density-reachable from p are assigned
to the same local cluster. Step 3a of the algorithm is designed for enforcing must-link
constraints. If points connected by must-link constraints were assigned to different lo-
cal clusters, then such clusters are merged into, so-named, a core local cluster. Step 3b
was designed for further merging of the local clusters in order to enforce cannot-link
constraints that have not been satisfied yet. Cluster merging is done by means of hierar-
chical agglomerative clustering with single linkage. For each pair of candidate clusters
to be merged, it is checked whether they contain points that could be found in a set of
cannot-link constraints. If they are involved in a cannot-link constraint then the clusters
cannot be merged. The steps of the algorithm are stopped if the number of clusters does
not change any more.
    DBCluC [14] which was also based on the known DBSCAN algorithm [8] employs
an obstacle modelling approach for density-based clustering of large two–dimensional
datasets. By means of modelling of obstacles which improves the efficiency of clus-
tering it is also capable of detecting clusters of arbitrary shape and is not sensitive to
the order of points in a dataset as well as to the obstacle constraints and noise. The
efficiency of clustering is leveraged by a reduction of polygons modelling the obsta-
cles - the algorithm simply removes unnecessary edges from the polygons making the
clustering faster in terms of a number of constraints to be analysed.
    The DBCCOM algorithm [7] pre-processes an input dataset so that it considers the
presence of physical obstacles by modelling them - similarly to DBCluC. It can detect
clusters of arbitrary shapes and size and is also considered to be insensitive to noise as
well as an order of points in a dataset. The obstacles are modelled by representing them
as simple polygons and it uses a polygon edge reduction algorithm so that the number
of edges used to test visibility between points in space could be reduced in order to
improve the efficiency of DBCCOM so the results reported by authors of the algorithm
confirm that it can perform polygon reduction faster than DBCluC. The algorithm com-
prises of three steps: first, it reduces the obstacles by employing the above mentioned
14

Table 1. The notations related to instance-level constraints used in this paper and aux-
iliary variables and notations used in pseudo-code of the algorithm

     Notation       Description
     C=             The set of pairs of points that are in a
                    must-link relation.
     c= (p0 , p1 ) Two points p0 and p1 are in
                    a must-link relation (must
                    be assigned to the same resulting cluster).
     C= (p)         The set of points which are in a
                    must-link relation with point p.
     C6=            The set of pairs of points that are in
                    a cannot-link relation.
     c6= (r0 , r1 ) Two points r0 and r1 are
                    in a cannot-link relation
                    (must not be assigned to the same resulting cluster).
     C6= (r)        The set of points which are in a
                    cannot-link relation with point r.
     ClusterId The auxiliary integer variable used for
                    storing currently-created clusterŠs identifier.
     p.ClusterId By using such a notation we refer to a ClusterId related to point p.
     p.ndf          Such a notation is used to refer to a value
                    of the NDF factor associated with point p.
     Rd , R t       The auxiliary variables for storing deferred points.
     DP Set         The variable for storing dense points.
                    It is used for in an iterative process of assigning points to clusters.


edge reduction method, then performs the clustering and finally applies hierarchical
clustering on formed clusters.


4      Density-Based Clustering
The DBSCAN algorithm is a well known density-based clustering algorithm. Below we
remind the key definitions related to the DBSCAN algorithm which will be used in the
sequel. Notations and their descriptions are given in Table 1.
Definition 1 (ǫ–neighborhood, or ǫN N (p) of point p). ǫ–neighborhood of point p is
the set of all points q in dataset D that are distant from p by no more than ǫ; that is,
ǫN N (p) = {q ∈ D|dist(p, q) ≤ ǫ}, where dist is a distance function.
Definition 2 (directly density-reachable points). Point p is directly density reachable
(Figure 1) from point q with respect to ǫ and M inP ts if the following two conditions
are satisfied:

a) p ∈ ǫN N (q)
                                                                                        15


Fig. 1. p0 is directly density-reachable         Fig. 2. Both p0 and p5 are density-
from chore point p1 ; p0 is density-             reachable from chore point p2 , so p0 and
reachable from p2 (M inP ts = 6).                p5 belong to C(p2 ) (M inP ts = 6).


b) q is a core point.
Definition 3 (density-reachable points). Point p is density-reachable from a point q
with respect to epsilon and M inP ts ) if there is a sequence of points p1 , ..., pn such
that p1 = q, pn = p and pi +1 and is directly density-reachable from pi , i = 1 . . . n−1.
(Figure 2)
Definition 4 (a border point). Point p is a border point if it is not a core point and is
density-reachable from a core point.
Definition 5 (cluster). A cluster is a non-empty set of points in D which are density-
reachable from a same core point.
Definition 6 (noise). Noise is the set of all points in D that are not density-reachable
from any core point.
    Firstly, the algorithm generates a label for the first cluster to be found. Next, the
points in D are read. The value of the ClusterId attribute of the first point read is equal
to UNCLASSIFIED. While the algorithm analyzes point after point, it may happen that
the ClusterId attributes of some points may change before these points are actually
analyzed. Such a case may occur when a point is density-reachable from a core point
examined earlier. Such density-reachable points will be assigned to the cluster of a
core point and will not be analyzed later. If a currently analyzed point p has not been
classified yet (the value of its ClusterId attribute is equal to UNCLASSIFIED), then
the ExpandCluster function is called for this point. If p is a core point, then all points
in C(p) are assigned by the ExpandCluster function to the cluster with a label equal to
the currently created cluster’s label. Next, a new cluster label is generated by DBSCAN.
Otherwise, if p is not a core point, the attribute ClusterId of point p is set to NOISE,
which means that point p will be tentatively treated as noise. After analyzing all points
in D, each pointŠs attribute ClusterId stores a respective cluster label or its value is
equal to NOISE. In other words, D contains only points which have been assigned to
particular clusters or are noise.
16

5     Clustering with Instance Constraints
In this section we present how we adapted instance constraints in DBSCAN.
    In our proposal, we introduce another type of points, namely the deferred points
(Definition 7). Below we present the definition of deferred point as well as modified
definitions of cluster and noise - Definition 9 and Definition 10, respectively.
Definition 7 (deferred point). A point p is deferred if it is involved in a cannot-link
relationship with any other point or it belongs to a ǫ–neighborhood ǫN N (q), where q
is any point involved in a cannot-link relationship.
Definition 8 (parent point). A parent point of a given point p is the point q which is
involved in a cannot-link relationship with any other point and p is located withing q’s
ǫN N (q).
Definition 9 (cluster). A cluster is a maximal non-empty subset of D such that:
    – for two non-deferred points p and q in the cluster, p and q are neighborhood-based
      density-reachable from a local core point with respect to k, and if p belongs to
      cluster C and q is also neighborhood-based density connected with p with respect
      to k, then q belongs to C;
    – a deferred point p is assigned to a cluster c if the nearest neighbour of p belongs to
      c, otherwise p is considered as a noise point.

Definition 10 (noise). The noise is the set of all points in D that:
    – are not density-reachable from any core point or
    – being a deferred point had two or more neighbours located at the same distance
      and thus could not be unambiguously assigned to a cluster.

    Our method works so that the DBSCAN algorithm (Figure 3) in the phase of prepa-
ration omits all points which are involved in any cannot-link relationship and marks
them as DEFERRED. Then, it adds those points to an auxiliary list called Rd which will
be later used in the main loop of the algorithm and the AssignDeferredPoints function.
    Then the algorithm iterates through all UNCLASSIFIED points from D except
those which were added to Rd . For all of those points it calls the ExpandCluster func-
tion (Figure 5) and passes all necessary parameters. The main modifications in this
function concern how points involved in must-link relationships are processed. Basi-
cally, if such a point is found and it is a core point or belongs to a neighbourhood of a
points which is a core point, then it is assigned to seeds or curSeeds lists depending
on which part of the ExpandCluster function is being executed.
    The last part of the algorithm is to process DEFERRED points. This is done by
means of the AssignDeferredPoints function (Figure 4). This function performs so that
for each point q from Rd (a list of points which were marked as DEFERRED in the main
algorithm method) it determines what would be the nearest cluster of that point (gp ).
Additionally it analyzes cannot-link points connected from the currently processed area
(an ǫ–neighbourhood of p) so that it checks whether the parent point of q (p) (which
nearest cluster is gp ) and is in a cannot-link relationship with another parent point (p6= )
was assigned to the same cluster (gp6= ) If so, then the point q is marked as NOISE.
Otherwise, q is assigned to cluster gp .
                                                                                                                                   17

 Algorihtm DBSCAN (D, k, C= , C6= )
  1.   Rd = ∅
  2.   label all points in D as UNCLASSIFIED;
  3.   ClusterId = label of a first cluster;                         Function ExpandCluster (D, p, ClId, M inP ts, ǫ, C= , C6= )
  4.   for each point q involved in any constraint from C6= do        1.   seeds = Neighborhood (D, p, ǫ);
  5.      label q and points in ǫN N (q) as DEFERRED;                 2.   if |seeds| < M inP ts then
  6.   endfor;                                                        3.       p.ClusterId = NOISE;
  7.   add all DEFERRED points to Rd ;                                4.       return FALSE;
  8.   foreach point p in set D \ Rd do                               5.   else do
  9.      if (p.ClusterId = UNCLASSIFIED) then                        6.       for each point q in seeds do
 10.          if ExpandCluster (D, p, ClusterId, ǫ, M inP ts) then    7.          q.ClusterId = ClId;
 11.              ClusterId = NextId (ClusterId);                     8.          add C= (q) to seeds;
 12.          endif ;                                                 9.       endfor
 13.      endif ;                                                    10.       delete p from seeds;
 14.   endfor;                                                       11.       while |seeds| > 0 do
 15.   AssignDefferedPoints(D, p, ClId, M inP ts, ǫ, C= , C6= );     12.          curP oint = first point in seeds;
                                                                     13.          curSeeds = Neighborhood (D, curP oint , ǫ);
                                                                     14.          if |curSeeds| ≥ M inP ts then

       Fig. 3. The DBSCAN algorithm.                                 15.
                                                                     16.
                                                                                      for each point q in curSeeds do
                                                                                          add C= (q) to seeds;
 Function AssignDeferredPoints(D, Rd , C6= )                         17.                  if q.ClusterId = UNCLASSIFIED then
  1.   for each point q ∈ Rd do                                      18.                      q.ClusterId = ClId;
  2.      p ← GetParent(q);                                          19.                      append q to seeds;
  3.      gp ← NearestCluster(p);                                    20.                  else if q.ClusterId = NOISE then
  4.      p6= ← C6= (p);                                             21.                      q.ClusterId = ClId;
  5.      gp6= = NearestCluster(p6= );                               22.                  end if;
  6.      if gp 6= gp6= then                                         23.              end for;
  7.          mark q as NOISE;                                       24.          end if;
  8.      else if                                                    25.       delete curP oint from seeds;
  9.          assign point q to gp ;                                 26.       end while;
 10.      end if ;                                                   27.   end else;
 11.      remove q from Rd ;
 12.   end for;

                                                                        Fig. 5. The ExpandCluster function.
Fig. 4. Assigning deferred points to clus-
ters.
      Fig. 6. The pseudo-code of the DBSCAN algorithm using instance constraints.


6     Conclusions and Further Research


Must-link and cannot-link constraints are supposed to work so that points connected by
must-link constraints are assigned to the same clusters and points which are in cannot-
link relationships cannot be assigned to the same clusters. Obviously such constraints
may lead to many problems, not to mention the fact that some constraints may be con-
tradictory. Our approach slightly loosens both must-link and cannot-link constraints
by prioritizing them. The intuition is that must-link constraint are more important than
cannot-link constraints which gives us the advantage when trying to fulfill all constraint.
We try to fulfill all must-link constraints first (assuming of course that all of them are
valid) treating them as more important than cannot-link constraints. When processing
cannot-link constraints, points which are contradictory (in terms of fulfilling both must-
link and cannot-link constraints are intuitively marked as noise.
    In this paper we presented a method of adapting the DBSCAN algorithm to work
with instance constraints. Our future works will focus on analyzing the complexity
of the proposed method as well as on testing its performance compared to other con-
strained clustering algorithms.
18

References
 1. Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding, In Proceedings
    of 19th International Conference on Machine Learning (ICML-2002, Citeseer, 2002.
 2. Basu, S., Davidson, I., Wagstaff, K.: Constrained clustering: Advances in algorithms, theory,
    and applications, CRC Press, 2008.
 3. Bentley, J. L.: Multidimensional binary search trees used for associative searching, 1975,
    509–517.
 4. Chang, H., Yeung, D.-Y.: Locally linear metric adaptation for semi-supervised clustering,
    Proceedings of the twenty-first international conference on Machine learning, ACM, 2004.
 5. Davidson, I., Basu, S.: A survey of clustering with instance level constraints, ACM Transac-
    tions on Knowledge Discovery from Data, 1, 2007, 1–41.
 6. Davidson, I., Ravi, S.: Clustering with Constraints: Feasibility Issues and the k-Means Algo-
    rithm., SDM, 5, SIAM, 2005.
 7. Duhan, N., Sharma, A.: DBCCOM: Density Based Clustering with Constraints and Obstacle
    Modeling, in: Contemporary Computing, Springer, 2011, 212–228.
 8. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clus-
    ters in large spatial databases with noise., Kdd, 96, 1996.
 9. Han, J., Kamber, M.: Data Mining, Southeast Asia Edition: Concepts and Techniques, Mor-
    gan kaufmann, 2006.
10. Hertz, T., Bar-Hillel, A., Weinshall, D.: Boosting margin based distance functions for clus-
    tering, Proceedings of the twenty-first international conference on Machine learning, ACM,
    2004.
11. Ruiz, C., Spiliopoulou, M., Menasalvas, E.: C-dbscan: Density-based clustering with con-
    straints, in: Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, Springer, 2007,
    216–223.
12. Wagstaff, K., Cardie, C.: Clustering with instance-level constraints, AAAI/IAAI, 1097, 2000.
13. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.: Constrained k-means clustering with
    background knowledge, ICML, 1, 2001.
14. Zaïane, O. R., Lee, C.-H.: Clustering spatial data when facing physical constraints, Data
    Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on, IEEE,
    2002.