=Paper=
{{Paper
|id=None
|storemode=property
|title=Subjectively Interesting Alternative Clusters
|pdfUrl=https://ceur-ws.org/Vol-772/multiclust2011-paper6.pdf
|volume=Vol-772
|dblpUrl=https://dblp.org/rec/conf/multiclust/Bie11
}}
==Subjectively Interesting Alternative Clusters==
Subjectively Interesting Alternative Clusters
Tijl De Bie
Intelligent Systems Laboratory, University of Bristol, UK
tijl.debie@gmail.com
http://www.tijldebie.net
Abstract. We deploy a recently proposed framework for mining sub-
jectively interesting patterns from data [3] to the problem of clustering,
where patterns are clusters in the data. This framework outlines how
subjective interestingness of patterns (here, clusters) can be quantified
using sound information theoretic concepts. We demonstrate how it mo-
tivates a new objective function quantifying the interestingness of (a set
of) clusters, automatically accounting for a user’s prior beliefs and for
redundancies between the clusters.
Directly searching for the optimal set of clusters defined in this way
is hard. However, the optimization problem can be solved to a prov-
ably good approximation if clusters are generated iteratively, parallel-
ing the iterative data mining setting discussed in [3]. In this iterative
scheme, each subsequent cluster is maximally interesting given the pre-
viously generated ones, automatically trading off interestingness with
non-redundancy. Thus, this implementation of the clustering approach
can be regarded as a method for alternative clustering. Although gener-
ating each cluster in an iterative fashion is computationally hard as well,
we develop an approximation technique similar to spectral clustering al-
gorithms.
We end with a few visual demonstrations of the alternative clustering
approach to artificial datasets.
Keywords: Subjective interestingness, alternative clustering
1 Introduction
A main challenge in research on clustering methods and theory is that clus-
tering is (in a way intentionally) ill-defined as a task. As a result, numerous
types of syntaxes for cluster patterns have been suggested (e.g. clusters as hy-
perrectangles, hyperspheres, ellipsoids, decision trees; clusterings as partitions,
hierarchical partitionings, etc). Additionally, even when the syntax is fixed, there
are often various alternative choices for the objective function (e.g. the K-means
cost function, the likelihood of a mixture of Gaussians, etc).
Despite this variety in approaches, the goal of clustering is almost always
to provide a user with insight in the structure of the data, allowing the user to
conceptualize it as coming from a number of broad areas in the data space. The
knowledge of such a structure can be more or less elucidating to the user, also
depending on the prior beliefs the user held about the data.
43
Here, we take the perspective that a clustering is more useful if it conveys
more novel information. We make a specific choice for a cluster syntax, and we
deploy ideas from [3] to quantify the interestingness of a cluster as the amount
of information conveyed to the user when told about the cluster’s presence.
Our approach attempts to quantify subjective interestingness of clusters, in
that it takes prior beliefs held by the user into account. As a result, different
clusters might be deemed interesting to different users. One particular example is
the situation where a user has already been informed about previously discovered
clusters in the data, which is the alternative clustering setting. In that case,
clusters that are individually informative while non-redundant will be the most
interesting ones. Our approach naturally deals with alternative clustering, by
regarding already communicated clusters as prior beliefs.
Throughout
this paper x ∈ Rd denotes a d-dimensional data point, and
X = x1 x2 · · · xn denotes the data matrix containing n data points as its
rows. The space the data matrix belongs to is denoted as X = Rn×d .
2 A framework for data mining: a brief overview
For completeness, we here provide a short overview of a framework for data
mining that was introduced in [3], and readers familiar with this paper can skip
this section. Earlier and more limited versions of this framework, as well as its
application to frequent pattern mining, can be found in [4, 2, 5]. For concreteness,
here we specialize the short overview of the framework to the case where the data
is a data set, summarized in the data matrix X.
The framework aims to formalize data mining as a process of information
exchange between the data and the data miner (the user). The goal of the data
miner is to get as good an understanding about the data as possible, i.e. to
reduce his uncertainty as much as possible. To be able to do this, the degree of
uncertainty must be quantified, and to this end we use probability distribution
P (referred to as the background distribution) to model the prior beliefs of the
user about the data X, in combination with ideas from information theory.
More specifically, the framework deals with the setting where the prior beliefs
specify that the background distribution belongs to one of a set P of possible
distributions. The more prior beliefs, the smaller this set will be. For example,
the data miner may have a set of prior beliefs that can be formalized in the form
of constraints the background distribution P satisfies:
fi (X)P (X) = ci , ∀i.
X∈Rn×d
Such constraints represent the fact that the expected value of certain statistics
(the functions fi ) are equal to a given number. The set P is defined as the set of
distributions satisfying these constraints. (Note that the framework is not limited
to such prior beliefs, although they are convenient from a practical viewpoint.)
We argued in [3] that among all distributions P ∈ P, the ‘best’ choice for P
is the one of maximum entropy given these constraints. This background distri-
bution is the least biased one, thus not introducing any other undue constraints
44
on the background distribution. A further game-theoretic argument in favour of
using the distribution of maximum entropy is given in [3].
In the framework, a pattern is defined as any piece of knowledge about the
data that reduces the set of possible values it may take from the original data
space X = Rn×d to a subset X . We then argued that the subjective interest-
ingness of such a pattern can be adequately formalized as the self-information
of the pattern, i.e. the negative logarithm of the probability that the pattern is
present in the data, i.e. by − log (P (X ∈ X )). Thus, patterns are deemed more
interesting if their probability is smaller under the background model, and thus
if the user is more surprised by their observation.
After observing a pattern, a user instinctively adapts his beliefs. In [3] we ar-
gued that a natural and robust way to model this is by updating the background
distribution to a new distribution P defined as P conditioned on the pattern’s
presence. The self-information of subsequent patterns can thus be evaluated by
referring to the new background distribution P , and so on in an iterative fashion.
In [3] we showed that mining the most informative set of patterns formally
corresponds to a weighted set coverage problem, attempting to cover as many
elements from the set X that have a small probability under the initial back-
ground distribution P . This problem is NP-hard, but it can be approximated
well by a greedy approach, and the iterative data mining approach is equivalent
to such a greedy approximation.
Thus, the alternative clustering method we will detail below generates clus-
ters in an iterative manner, in such a way that at any time the clusters generated
so far are approximately the most informative set of clusters of that size.
3 Subjective interestingness of clusters
3.1 Prior beliefs and the maximum entropy background distribution
Here we consider two types of initial prior beliefs, expressed as constraints on
the first and second order cumulants of the data points. It is conceptually easy
to extend the results from this paper to other types of prior beliefs, although the
computational cost will vary. The background distribution incorporating these
constraints is the maximum entropy distribution that has the prescribed first
and second order cumulants. It is well known that for data with unbounded
domain, this distribution is the multivariate Gaussian distribution with mean
and covariance matrix equal to the prescribed cumulants:
1 1
P (X) = exp − trace (X − eμ )Σ −1 (X − eμ ) . (1)
nd
(2π) |Σ| n 2
We note that the prescribed cumulants may be computed from the actual
data at the request of the data miner, such that they are indeed part of the
prior knowledge. However, they may also be beliefs, in the sense that they may
be based on external information or assumptions that may be right or wrong.
45
3.2 A syntax for cluster patterns
The framework from [3] was developed for patterns generally defined as prop-
erties of the data. Thus, a pattern’s presence in the data constrains the set of
possible values the data may have, and in this sense the knowledge of the presence
of a pattern reduces the uncertainty about the data and conveys information.
In this paper we restrict our attention to one specific type of cluster pattern.
The pattern type we consider is parameterized by a set of indices to the data
points and a vector in the data space. The pattern is then the fact that the mean
of the data points for these indices is equal to the specified vector.
More formally, let eI ∈ Rd be defined as an indicator vector containing zeros
at positions i ∈ I and ones at positions i ∈ I, and let nI = |I| = eI eI denote
the number of elements in I. Then, our patterns are constraints of the form:
1
xi = μI ,
nI
i∈I
eI
⇔ X = μI .
eI eI
Such a constraint restricts the possible values of the data set X, in that the
mean of a subset of the data points is required to have a prescribed value μI .
3.3 The self-information of a cluster pattern
The following theorem shows how to assess the self-information of a pattern.
Theorem 1. Given a background distribution of the form in Eq. (1), the prob-
ability of a pattern of the form X eeIeI = μI is given by:
I
eI 1 1
P X = μI = exp − eI · (X − eμ )Σ −1 (X − eμ ) · eI .
eI eI d
(2π) |Σ| 2|I|
Thus the self-information of the pattern for a cluster specified by the set I,
defined as the negative log probability of the cluster pattern and denoted as
SelfInformationI , is equal to:
1 1
SelfInformationI = log (2π)d |Σ| + QI ,
2 2
1
where QI = e · (X − eμ )Σ −1 (X − eμ ) · eI .
|I| I
Note that the self-information depends on I only through QI , so we may choose
to use QI as a quality metric for a cluster, as we will do in this paper.
This theorem can be used to quantify the self-information of any cluster
given the background model based on the prior beliefs of the data miner. Note
however that it cannot be used to assess the self-information of a cluster after
other clusters have already been found, as each new cluster will affect the user’s
prior beliefs. How this can be accounted for will be discussed in Sec. 3.4, based on
a generalization of Theorem 1. As Theorem 1 directly follows from Theorem 2,
we will only provide a proof for the latter in Sec. 3.4.
46
3.4 The self-information of a set of cluster patterns
Let us discuss the more general case, where the patterns are constraints of the
form X E = M with E ∈ Rn×k and M ∈ Rd×k . Clearly, the type of patterns
we wish to consider are a special case of this, with E = ee IeI and M = μI .
I
Furthermore, it allows us to consider a composite pattern, a pattern defined as
the union of a set of k patterns. Indeed, if we have k different cluster patterns
specified by the sets from I = {Ii }, we can write down this set of constraints
e
concisely as X E = M where E and M contain e IeiI and the mean vector μi
Ii i
of the i’th cluster as their i’th columns.
Theorem 2. Let the columns of the matrix E be the indicator vectors of the sets
in I = {Ii }, and let PE = E(E E)−1 E , the projection matrix onto the column
space of E. Then, the probability of the composite pattern X E = M is given by:
1 1
P (X E = M) = exp − trace PE · (X − eμ )Σ −1 (X − eμ ) .
(2π)kd |Σ|k 2
Thus the self-information of the set of patterns defined by the columns of E,
defined as its negative log probability and denoted as SelfInformationI , is equal
to:
k 1
SelfInformationI = log (2π)d |Σ| + QI ,
2 2
where QI = trace PE · (X − eμ )Σ −1 (X − eμ ) .
Again, since the self-information depends on I only through QI , we choose to
use QI as a quality metric for a cluster further below.
Proof. A constraint X E = M constrains the data X to an (n − k) × d di-
mensional affine subspace in the following way. Let us write the singular value
decomposition for E as:
Λ0
E = U U0 V V0 .
0 0
Then, this constraint can be written in the following form:
X = UZ + U0 Z0 ,
−1
where Z = Λ V M is a constant fixed by E and M, and Z0 ∈ R(n−k)×d is
a variable. In general, writing X = UZ + U0 Z0 , we can write the probability
density for X as:
P (X) = P (Z, Z0 ),
1 1
= exp − trace (UZ + U0 Z0 − eμ )Σ −1 (UZ + U0 Z0 − eμ ) ,
nd
(2π) |Σ| n 2
1 1
= exp − trace (Z0 − U0 eμ )Σ −1 (Z0 − U0 eμ )
(2π)(n−k)d |Σ|n−k 2
1 1
· exp − trace (Z − U eμ )Σ −1 (Z − U eμ )
(2π)(k)d |Σ| k 2
47
We can now compute the marginal probability density for Z by integrating over
Z0 , yielding:
1 1
P (Z) = exp − trace (Z − U eμ )Σ −1 (Z − U eμ ) .
kd
(2π) |Σ| k 2
The probability density value for the pattern’s presence, i.e. for X E = M or
equivalently Z = Λ−1 V M , is thus:
P (Z = Λ−1 V M )
1 1
= exp − trace (Λ−1 V M − U eμ )Σ −1 (Λ−1 V M − U eμ ) ,
(2π)kd |Σ|k 2
1 1
= exp − trace PE · (X − eμ )Σ −1 (X − eμ ) ,
kd
(2π) |Σ| k 2
where PE = E(E E)−1 E is a projection matrix projecting onto the k-dimensio-
nal column space of E.
Note that Theorem 1 is indeed a special case of Theorem 2 as can be seen
eI eI
by substituting E = eI and PE = |I| .
According to the framework, a (composite) pattern
specified by matrices E
and M in this way is thus more informative if trace PE · (X − eμ )Σ −1 (X − eμ )
is larger. Thus, we could search for the most informative set of clusters by max-
imizing this quality measure with respect to a set I = {Ii } of clusters. This is a
hard problem though, even in the case where only one cluster is sought. Thus,
we developed an approximation algorithm.
Our approach is approximate in two ways. First, the search for a set of clusters
is approximated using a greedy algorithm, searching clusters one by one, thus
operating like an alternative clustering algoritm. From [3], it can be seen that this
greedy approximation is guaranteed to approximate the true optimum provably
well. Second, the search for each cluster is relaxed to an eigenvalue problem.
These issues are discussed in greater detail in Sec. 4.
3.5 The cost of describing a cluster
The framework from [3] suggests to take into account not only the self-informa-
tion of a pattern, but also the cost to communicate a pattern, i.e. its description
length. This depends on the coding scheme used, which should reflect the per-
ceived complexity of a pattern as perceived by the data miner. Choosing this
coding scheme can also be done so as to bias the results toward specific patterns.
In the current context, describing a pattern amounts to describing the subset
I and the mean vector μI . For simplicity, we assume the description length is
constant for all patterns, independent of I and μI . However, note that different
costs could be used if patterns with smaller sets I are more easy to understand
(i.e. have a smaller cost), or vice versa.
48
4 Alternative clustering: finding the next most
informative cluster
Here we discuss an iterative approach to optimizing the quality measure QI from
Theorem 2. There are two reasons for choosing an iterative approach.
Firstly, directly optimizing the quality measure is equivalent to a set covering
type problem (see [3] for more background on why this is the case). While NP-
hard, this problem can be approximated well by optimizing over the different
clusters (and thus the columns of E) in a greedy iterative manner.
Secondly, usually it is not a priori clear how many clusters are required for the
data miner to be sufficiently satisfied with his new understanding of the data. The
idea of alternative clustering, as we view it, is to provide the user the opportunity
to request new clusters (or clusterings) as long as more are desired. Optimizing
the quality measure over a growing set of clusters by iteratively optimizing over
a newly added column of E is thus a type of alternative clustering.
Hence, the iterative approach can be regarded as an approximation, but one
with usability benefits over a global optimizing approach.
4.1 The iterative scheme: alternative clustering
To search for the first cluster, we attempt to optimize the quality function from
Theorem 1. This is itself a hard problem, but we explain how we approximately
solve it in Sec. 4.2.
In the subsequent iterations, let us say that we have already found k − 1 ≥ 1
clusters, and the matrices E and M respectively contain the normalized indicator
vectors and cluster means as their columns. We are interested in finding the k’th
cluster so as to optimize the quality measure from Theorem 2 but keeping the
first k − 1 cluster patterns as they are.
To do this, it is convenient to write the quality measure as a function of the
k’th cluster with indicator vector ek . Let QE = I − PE , the projection
matrix
on the null column space of E. Furthermore, let us denote E∗ = E ek . Then,
using the definition of a projection matrix and the matrix inversion lemma:
Q{Ii |i=1:k} = trace PE∗ · (X − eμ )Σ −1 (X − eμ ) ,
= trace PE · (X − eμ )Σ −1 (X − eμ )
QE ek ek QE
+trace · (X − eμ )Σ −1 (X − eμ ) ,
ek QE ek
ek QE · (X − eμ )Σ −1 (X − eμ ) QE ek
= Q{Ii |i=1:k−1} + .
ek QE ek
Note that if we define Q∅ = 0 and with QE = I the quality measure from
Theorem 1 is retrieved. We can thus interpret the above reformulation of the
quality metric for the k’th cluster conditioned on the first k − 1 clusters as being
the quality metric for a first cluster on data that is projected onto the space
49
orthogonal to the k − 1 columns of E, i.e. the k − 1 previously selected indicator
vectors. It is as if the data was deflated to take account of the knowledge of the
previously found cluster patterns, thus automatically accounting for redundancy.
4.2 A spectral relaxation of the iterations
Each of the iterative steps thus reduces to the maximization of the following
increase of the quality measure:
ek QE · (X − eμ )Σ −1 (X − eμ ) QE ek
ΔQk = . (2)
ek QE ek
If we relax the vector ek to be real-valued instead of containing only 0’s and 1’s,
this Rayleigh quotient is maximized by the dominant eigenvector of the matrix
QE · (X − eμ )Σ −1 (X − eμ ) QE . Thus, as an approximation technique we will
use this dominant eigenvector, and threshold it to obtain a crisp 0/1 vector
ek . To determine a suitable threshold, we simply do an exhaustive search over
n + 1 threshold values that generate a different set I of indices i ∈ I for which
ek (i) = 1, selecting the threshold that maximizes the quantity in Eq. (2).
4.3 A kernel-based version
Note that for Σ = I, the quality metrics depend on X only through the inner
product matrix XX . This means that a kernel-variant is readily derived, by
substituting this inner product matrix with any suitable kernel matrix. In this
way nonlinearly shaped clusters can be obtained, similar to spectral clustering
methods and kernel K-Means.
5 Relations to existing work
There appear to be strong relations between spectral clustering and our spec-
tral relaxation of the problem [7]. Additionally, the quality measure is strongly
related to the K-Means cost function [8, 6]. Finally, there seem to be interesting
connections to (0-1) SDP problems used for solving combinatorial optimization
problems such as clustering (e.g. K-Means and graph cut clustering) [6, 1].
6 Experiments
We conducted 3 experiments, each time reporting the result of 6 iterations of the
alternative clustering scheme. Initial prior beliefs are always μ = 0 and Σ = I.
– A plain application to a synthetic dataset of 100 points and 2 dimensions in
4 clusters. See Fig. 1.
– An application to the same data but now with a Radial Basis Function
(RBF) kernel used for the inner products. See Fig. 2.
– An application with an RBF kernel to a different synthetic dataset, with
one central cluster and two half-moon shaped clusters around this central
cluster. See Fig. 3.
50
948.3664 916.308
10 10
5 5
0 0
−5 −5
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
923.4161 88.7828
10 10
5 5
0 0
−5 −5
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
55.4025 35.9984
10 10
5 5
0 0
−5 −5
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
Fig. 1. A synthetic dataset with 25 data points sampled from each of 4 2-dimensional
Gaussian distributions with identity covariance matrix and different means. From left
to right and top to bottom, the plots show the first 6 consecutive alternative clusters
found by our method when a standard inner product is used (data points belonging
to the cluster are plotted using crosses). The numbers above the plots are the values
of ΔQk from Eq. (2) for the cluster shown. Note that it is high for the first 3 clusters,
which reveal the enforced cluster structure, before dropping to a much lower level.
51
27.9164 19.9145
10 10
5 5
0 0
−5 −5
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
18.9836 13.7828
10 10
5 5
0 0
−5 −5
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
2.5518 1.8854
10 10
5 5
0 0
−5 −5
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
Fig. 2. A synthetic dataset with 25 data points sampled from each of 4 2-dimensional
Gaussian distributions with identity covariance matrix and different means. From left
to right and top to bottom, the plots show the first 6 consecutive alternative clusters
when an RBF kernel with kernel width 3 is used. The numbers above the plots are the
values of ΔQk from Eq. (2) for the cluster shown.
52
11.324 10.2221
2 2
1 1
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
10.192 8.2983
2 2
1 1
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
8.1599 6.9848
2 2
1 1
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
Fig. 3. A synthetic dataset with a central set of 20 data points surrounded by two half
moons of 40 data points each. The plots show the first 6 consecutive alternative clusters
when an RBF kernel with kernel width 0.3 is used. The numbers above the plots are
the values of ΔQk from Eq. (2) for the cluster shown. Note that the second cluster
generated contains all data points. This is possible and sensible from the perspective
of our approach if the mean of the entire data set is significantly different from the
expected mean in the initial background model. This may well be the case when working
in a Hilbert space induced by the RBF kernel, where all data points lie in one orthant
such that their mean cannot be in the origin.
53
7 Conclusions
In [3] we introduced a framework for data mining, aiming to quantify the subjec-
tive interestingness of patterns. We showed that Principal Component Analysis
can be seen as implementing this framework for a particular pattern type and
prior beliefs, thus providing an alternative justification for this method. In earlier
work we also showed the potential of the framework in quantifying subjective in-
terestingness for frequent itemset mining [2, 4, 5]. Now, in the present paper, we
showed in detail how the framework can also be applied successfully to the case
of clustering, leading to a new approach for alternative clustering that presents
subjectively interesting clusters in data in an iterative data mining scheme.
In further work, we will investigate the quality of the spectral relaxation, and
consider the development of tighter relaxations (e.g. to semi-definite programs).
We will also further develop links with spectral clustering and other existing
clustering approaches, to provide alternative justifications and insights or to
improve on these approaches. We will also investigate the use of other pattern
syntaxes for cluster(ing)s, and the use of more complex types of prior beliefs.
Lastly, we plan to demonstrate the power of the framework by also applying
these ideas to other types of data and corresponding types of prior beliefs, such
as positive real-valued, integer, binary, and more structured types of data.
Due to space constraints, in this workshop paper we could not situate the
contributions within the wider literature on alternative clustering. We will rectify
this important shortcoming in a later version of this workshop paper.
Acknowledgements
This work is supported by the EPSRC grant EP/G056447/1.
References
1. T. De Bie and N. Cristianini. Fast sdp relaxations of graph cut clustering, trans-
duction, and other combinatorial problems. Journal of Machine Learning Research,
7:1409–1436, 2006.
2. T. De Bie. Maximum entropy models and subjective interestingness: an application
to tiles in binary databases. Data Mining and Knowledge Discovery, 2010.
3. T. De Bie. An information-theoretic framework for data mining. In Proc. of the
17th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD11), 2011.
4. T. De Bie, K.-N. Kontonasios, and E. Spyropoulou. A framework for mining inter-
esting pattern sets. SIGKDD Explorations, 2010.
5. K.-N. Kontonasios and T. De Bie. An information-theoretic approach to finding
informative noisy tiles in binary databases. In Proceedings of the 2010 SIAM Inter-
national Conference on Data Mining, 2010.
6. Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite
programming. SIAM Journal on Optimization, 18(1), 2007.
7. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
8. H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectral relaxation for k-means
clustering. In Advances in Neural Information Processing Systems 14 (NIPS01),
pages 1057–1064, 2002.
54