<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sub jectively Interesting Alternative Clusters</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tijl De Bie</string-name>
          <email>tijl.debie@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Systems Laboratory, University of Bristol</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <fpage>43</fpage>
      <lpage>54</lpage>
      <abstract>
        <p>We deploy a recently proposed framework for mining subjectively interesting patterns from data [3] to the problem of clustering, where patterns are clusters in the data. This framework outlines how subjective interestingness of patterns (here, clusters) can be quantified using sound information theoretic concepts. We demonstrate how it motivates a new objective function quantifying the interestingness of (a set of) clusters, automatically accounting for a user's prior beliefs and for redundancies between the clusters. Directly searching for the optimal set of clusters defined in this way is hard. However, the optimization problem can be solved to a provably good approximation if clusters are generated iteratively, paralleling the iterative data mining setting discussed in [3]. In this iterative scheme, each subsequent cluster is maximally interesting given the previously generated ones, automatically trading off interestingness with non-redundancy. Thus, this implementation of the clustering approach can be regarded as a method for alternative clustering. Although generating each cluster in an iterative fashion is computationally hard as well, we develop an approximation technique similar to spectral clustering algorithms. We end with a few visual demonstrations of the alternative clustering approach to artificial datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>Subjective interestingness</kwd>
        <kwd>alternative clustering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>A main challenge in research on clustering methods and theory is that
clustering is (in a way intentionally) ill-defined as a task. As a result, numerous
types of syntaxes for cluster patterns have been suggested (e.g. clusters as
hyperrectangles, hyperspheres, ellipsoids, decision trees; clusterings as partitions,
hierarchical partitionings, etc). Additionally, even when the syntax is fixed, there
are often various alternative choices for the objective function (e.g. the K-means
cost function, the likelihood of a mixture of Gaussians, etc).</p>
      <p>Despite this variety in approaches, the goal of clustering is almost always
to provide a user with insight in the structure of the data, allowing the user to
conceptualize it as coming from a number of broad areas in the data space. The
knowledge of such a structure can be more or less elucidating to the user, also
depending on the prior beliefs the user held about the data.</p>
      <p>
        Here, we take the perspective that a clustering is more useful if it conveys
more novel information. We make a specicfi choice for a cluster syntax, and we
deploy ideas from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to quantify the interestingness of a cluster as the amount
of information conveyed to the user when told about the cluster’s presence.
      </p>
      <p>Our approach attempts to quantify subjective interestingness of clusters, in
that it takes prior beliefs held by the user into account. As a result, different
clusters might be deemed interesting to different users. One particular example is
the situation where a user has already been informed about previously discovered
clusters in the data, which is the alternative clustering setting. In that case,
clusters that are individually informative while non-redundant will be the most
interesting ones. Our approach naturally deals with alternative clustering, by
regarding already communicated clusters as prior beliefs.</p>
      <p>Throughout this paper x ∈ Rd denotes a d-dimensional data point, and
X = x1 x2 · · · xn denotes the data matrix containing n data points as its
rows. The space the data matrix belongs to is denoted as X = Rn×d.
2</p>
    </sec>
    <sec id="sec-2">
      <title>A framework for data mining: a brief overview</title>
      <p>
        For completeness, we here provide a short overview of a framework for data
mining that was introduced in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and readers familiar with this paper can skip
this section. Earlier and more limited versions of this framework, as well as its
application to frequent pattern mining, can be found in [
        <xref ref-type="bibr" rid="ref2 ref4 ref5">4, 2, 5</xref>
        ]. For concreteness,
here we specialize the short overview of the framework to the case where the data
is a data set, summarized in the data matrix X.
      </p>
      <p>The framework aims to formalize data mining as a process of information
exchange between the data and the data miner (the user). The goal of the data
miner is to get as good an understanding about the data as possible, i.e. to
reduce his uncertainty as much as possible. To be able to do this, the degree of
uncertainty must be quantiefid, and to this end we use probability distribution
P (referred to as the background distribution) to model the prior beliefs of the
user about the data X, in combination with ideas from information theory.</p>
      <p>More specifically, the framework deals wi th the setting where the prior beliefs
specify that the background distribution belongs to one of a set P of possible
distributions. The more prior beliefs, the smaller this set will be. For example,
the data miner may have a set of prior beliefs that can be formalized in the form
of constraints the background distribution P satisfies:</p>
      <p>X∈ Rn×d</p>
      <p>fi(X)P (X) = ci, ∀i.</p>
      <p>Such constraints represent the fact that the expected value of certain statistics
(the functions fi) are equal to a given number. The set P is defined as the set of
distributions satisfying these constraints. (Note that the framework is not limited
to such prior beliefs, although they are convenient from a practical viewpoint.)</p>
      <p>
        We argued in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that among all distributions P ∈ P , the ‘best’ choice for P
is the one of maximum entropy given these constraints. This background
distribution is the least biased one, thus not introducing any other undue constraints
on the background distribution. A further game-theoretic argument in favour of
using the distribution of maximum entropy is given in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In the framework, a pattern is defined as any piece of knowledge about the
data that reduces the set of possible values it may take from the original data
space X = Rn×d to a subset X . We then argued that the subjective
interestingness of such a pattern can be adequately formalized as the self-information
of the pattern, i.e. the negative logarithm of the probability that the pattern is
present in the data, i.e. by − log (P (X ∈ X )). Thus, patterns are deemed more
interesting if their probability is smaller under the background model, and thus
if the user is more surprised by their observation.</p>
      <p>
        After observing a pattern, a user instinctively adapts his beliefs. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] we
argued that a natural and robust way to model this is by updating the background
distribution to a new distribution P defined as P conditioned on the pattern’s
presence. The self-information of subsequent patterns can thus be evaluated by
referring to the new background distribution P , and so on in an iterative fashion.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] we showed that mining the most informative set of patterns formally
corresponds to a weighted set coverage problem, attempting to cover as many
elements from the set X that have a small probability under the initial
background distribution P . This problem is NP-hard, but it can be approximated
well by a greedy approach, and the iterative data mining approach is equivalent
to such a greedy approximation.
      </p>
      <p>Thus, the alternative clustering method we will detail below generates
clusters in an iterative manner, in such a way that at any time the clusters generated
so far are approximately the most informative set of clusters of that size.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Subjective interestingness of clusters</title>
      <sec id="sec-3-1">
        <title>Prior beliefs and the maximum entropy background distribution</title>
        <p>Here we consider two types of initial prior beliefs, expressed as constraints on
the rfist and second order cumulants of the data points. It is conceptually easy
to extend the results from this paper to other types of prior beliefs, although the
computational cost will vary. The background distribution incorporating these
constraints is the maximum entropy distribution that has the prescribed rfist
and second order cumulants. It is well known that for data with unbounded
domain, this distribution is the multivariate Gaussian distribution with mean
and covariance matrix equal to the prescribed cumulants:</p>
        <p>P (X) =</p>
        <p>1
exp</p>
        <p>1
− 2 trace (X − eμ )Σ −1(X − eμ )
.</p>
        <p>(1)</p>
        <p>We note that the prescribed cumulants may be computed from the actual
data at the request of the data miner, such that they are indeed part of the
prior knowledge. However, they may also be beliefs, in the sense that they may
be based on external information or assumptions that may be right or wrong.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>A syntax for cluster patterns</title>
        <p>
          The framework from [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] was developed for patterns generally defined as
properties of the data. Thus, a pattern’s presence in the data constrains the set of
possible values the data may have, and in this sense the knowledge of the presence
of a pattern reduces the uncertainty about the data and conveys information.
        </p>
        <p>In this paper we restrict our attention to one specific type of cluster pattern.
The pattern type we consider is parameterized by a set of indices to the data
points and a vector in the data space. The pattern is then the fact that the mean
of the data points for these indices is equal to the specified vector.</p>
        <p>More formally, let eI ∈ Rd be defined as an indicator vector containing zeros
at positions i ∈ I and ones at positions i ∈ I, and let nI = |I| = eI eI denote
the number of elements in I. Then, our patterns are constraints of the form:
1
nI i∈ I</p>
        <p>eI
⇔ X
eI eI
xi = μI ,
= μI .</p>
        <p>Such a constraint restricts the possible values of the data set X, in that the
mean of a subset of the data points is required to have a prescribed value μI .
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>The self-information of a cluster pattern</title>
        <p>The following theorem shows how to assess the self-information of a pattern.
Theorem 1. Given a background distribution of the form in Eq. (1), the
probability of a pattern of the form X eIeIeI = μI is given by:
P</p>
        <p>X
eI
eI eI
= μI
=</p>
        <p>1
(2π )d|Σ |
exp</p>
        <p>1
− 2|I|
eI · (X − eμ )Σ −1(X − eμ ) · eI .</p>
        <p>Thus the self-information of the pattern for a cluster specified by the set I,
defined as the negative log probability of the cluster pattern and denoted as
SelfInformationI , is equal to:</p>
        <p>1 1
SelfInformationI = log (2π )d|Σ | + QI ,
2 2
1
where QI = eI · (X − eμ )Σ −1(X − eμ ) · eI .</p>
        <p>|I|
Note that the self-information depends on I only through QI , so we may choose
to use QI as a quality metric for a cluster, as we will do in this paper.</p>
        <p>This theorem can be used to quantify the self-information of any cluster
given the background model based on the prior beliefs of the data miner. Note
however that it cannot be used to assess the self-information of a cluster after
other clusters have already been found, as each new cluster will affect the user’s
prior beliefs. How this can be accounted for will be discussed in Sec. 3.4, based on
a generalization of Theorem 1. As Theorem 1 directly follows from Theorem 2,
we will only provide a proof for the latter in Sec. 3.4.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>The self-information of a set of cluster patterns</title>
        <p>of the i’th cluster as their i’th columns.</p>
        <p>Let us discuss the more general case, where the patterns are constraints of the
form X E = M with E ∈ Rn×k and M ∈ Rd×k. Clearly, the type of patterns
we wish to consider are a special case of this, with E = eIeIeI and M = μI .
Furthermore, it allows us to consider a composite pattern, a pattern defined as
the union of a set of k patterns. Indeed, if we have k different cluster patterns
specified by the sets from I = {Ii}, we can write down this set of constraints
concisely as X E = M where E and M contain eIeiIeiIi and the mean vector μi
Theorem 2. Let the columns of the matrix E be the indicator vectors of the sets
in I = {Ii}, and let PE = E(E E)−1E , the projection matrix onto the column
space of E. Then, the probability of the composite pattern X E = M is given by:
P (X E = M) =</p>
        <p>1
exp</p>
        <p>1
− 2 trace PE · (X − eμ )Σ −1(X − eμ )
.</p>
        <p>Thus the self-information of the set of patterns defined by the columns of E,
defined as its negative log probability and denoted as SelfInformationI , is equal
to:</p>
        <p>SelfInformationI =
k 1
2 log (2π )d|Σ | + 2 QI ,
where QI = trace PE · (X − eμ )Σ −1(X − eμ ) .</p>
        <p>Again, since the self-information depends on I only through QI , we choose to
use QI as a quality metric for a cluster further below.</p>
        <p>Proof. A constraint X E = M constrains the data X to an (n − k) × d
dimensional affine subspace in the following way. Let us write the singular value
decomposition for E as:</p>
        <p>E =</p>
        <p>U U0</p>
        <p>V V0
where Z = Λ−1V M is a constant xfied by E and M, and Z0 ∈ R(n−k)×d is
a variable. In general, writing X = UZ + U0Z0, we can write the probability
density for X as:
P (X) = P (Z,Z0),</p>
        <p>1
=
=
·</p>
        <p>1
− 2 trace (UZ + U0Z0 − eμ )Σ −1(UZ + U0Z0 − eμ )
exp</p>
        <p>1
− 2 trace (Z0 − U0eμ )Σ −1(Z0 − U0eμ )
exp</p>
        <p>1
− 2 trace (Z − U eμ )Σ −1(Z − U eμ )
We can now compute the marginal probability density for Z by integrating over
Z0, yielding:</p>
        <p>P (Z) =</p>
        <p>1
exp</p>
        <p>1
− 2 trace (Z − U eμ )Σ −1(Z − U eμ )
.</p>
        <p>The probability density value for the pattern’s presence, i.e. for X E = M or
equivalently Z = Λ−1V M , is thus:
=
=</p>
        <p>P (Z = Λ−1V M )
1
1
exp
exp</p>
        <p>1
− 2 trace (Λ−1V M − U eμ )Σ −1(Λ−1V M − U eμ )
,
1
− 2 trace PE · (X − eμ )Σ −1(X − eμ )
,
where PE = E(E E)−1E is a projection matrix projecting onto the
k-dimensional column space of E.</p>
        <p>Note that Theorem 1 is indeed a special case of Theorem 2 as can be seen
by substituting E = eI and PE = eIIeI .</p>
        <p>| |</p>
        <p>According to the framework, a (composite) pattern specified by matrices E
and M in this way is thus more informative if trace PE · (X − eμ )Σ −1(X − eμ )
is larger. Thus, we could search for the most informative set of clusters by
maximizing this quality measure with respect to a set I = {Ii} of clusters. This is a
hard problem though, even in the case where only one cluster is sought. Thus,
we developed an approximation algorithm.</p>
        <p>
          Our approach is approximate in two ways. First, the search for a set of clusters
is approximated using a greedy algorithm, searching clusters one by one, thus
operating like an alternative clustering algoritm. From [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], it can be seen that this
greedy approximation is guaranteed to approximate the true optimum provably
well. Second, the search for each cluster is relaxed to an eigenvalue problem.
These issues are discussed in greater detail in Sec. 4.
3.5
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>The cost of describing a cluster</title>
        <p>
          The framework from [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] suggests to take into account not only the
self-information of a pattern, but also the cost to communicate a pattern, i.e. its description
length. This depends on the coding scheme used, which should reflect the
perceived complexity of a pattern as perceived by the data miner. Choosing this
coding scheme can also be done so as to bias the results toward specific patterns.
        </p>
        <p>In the current context, describing a pattern amounts to describing the subset
I and the mean vector μI . For simplicity, we assume the description length is
constant for all patterns, independent of I and μI . However, note that different
costs could be used if patterns with smaller sets I are more easy to understand
(i.e. have a smaller cost), or vice versa.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Alternative clustering: finding the next most informative cluster</title>
      <p>Here we discuss an iterative approach to optimizing the quality measure QI from
Theorem 2. There are two reasons for choosing an iterative approach.</p>
      <p>
        Firstly, directly optimizing the quality measure is equivalent to a set covering
type problem (see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for more background on why this is the case). While
NPhard, this problem can be approximated well by optimizing over the different
clusters (and thus the columns of E) in a greedy iterative manner.
      </p>
      <p>Secondly, usually it is not a priori clear how many clusters are required for the
data miner to be sufficiently satisefid with his new understanding of the data. The
idea of alternative clustering, as we view it, is to provide the user the opportunity
to request new clusters (or clusterings) as long as more are desired. Optimizing
the quality measure over a growing set of clusters by iteratively optimizing over
a newly added column of E is thus a type of alternative clustering.</p>
      <p>Hence, the iterative approach can be regarded as an approximation, but one
with usability benetfis over a global optimizing approach.
4.1</p>
      <sec id="sec-4-1">
        <title>The iterative scheme: alternative clustering</title>
        <p>To search for the rfist cluster, we attempt to optimize the quality function from
Theorem 1. This is itself a hard problem, but we explain how we approximately
solve it in Sec. 4.2.</p>
        <p>In the subsequent iterations, let us say that we have already found k − 1 ≥ 1
clusters, and the matrices E and M respectively contain the normalized indicator
vectors and cluster means as their columns. We are interested in finding the k’th
cluster so as to optimize the quality measure from Theorem 2 but keeping the
first k − 1 cluster patterns as they are.</p>
        <p>To do this, it is convenient to write the quality measure as a function of the
k’th cluster with indicator vector ek. Let QE = I − PE, the projection matrix
on the null column space of E. Furthermore, let us denote E∗ = E ek . Then,
using the denfiition of a projection matrix and the matrix inversion lemma:
Q{Ii|i=1:k} = trace PE∗ · (X − eμ )Σ −1(X − eμ ) ,
= trace PE · (X − eμ )Σ −1(X − eμ )
+trace</p>
        <p>QEekekQE</p>
        <p>ekQEek
= Q{Ii|i=1:k−1} +</p>
        <p>· (X − eμ )Σ −1(X − eμ ) ,</p>
        <p>Note that if we denfie Q∅ = 0 and with QE = I the quality measure from
Theorem 1 is retrieved. We can thus interpret the above reformulation of the
quality metric for the k’th cluster conditioned on the rfist k − 1 clusters as being
the quality metric for a first cluster on data that is projected onto the space
orthogonal to the k − 1 columns of E, i.e. the k − 1 previously selected indicator
vectors. It is as if the data was deflated to take account of the knowledge of the
previously found cluster patterns, thus automatically accounting for redundancy.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>A spectral relaxation of the iterations</title>
        <p>Each of the iterative steps thus reduces to the maximization of the following
increase of the quality measure:
ΔQk =
If we relax the vector ek to be real-valued instead of containing only 0’s and 1’s,
this Rayleigh quotient is maximized by the dominant eigenvector of the matrix
QE · (X − eμ )Σ −1(X − eμ ) QE. Thus, as an approximation technique we will
use this dominant eigenvector, and threshold it to obtain a crisp 0/1 vector
ek. To determine a suitable threshold, we simply do an exhaustive search over
n + 1 threshold values that generate a different set I of indices i ∈ I for which
ek(i) = 1, selecting the threshold that maximizes the quantity in Eq. (2).
Note that for Σ = I, the quality metrics depend on X only through the inner
product matrix XX . This means that a kernel-variant is readily derived, by
substituting this inner product matrix with any suitable kernel matrix. In this
way nonlinearly shaped clusters can be obtained, similar to spectral clustering
methods and kernel K-Means.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Relations to existing work</title>
      <p>
        There appear to be strong relations between spectral clustering and our
spectral relaxation of the problem [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Additionally, the quality measure is strongly
related to the K-Means cost function [
        <xref ref-type="bibr" rid="ref6 ref8">8, 6</xref>
        ]. Finally, there seem to be interesting
connections to (0-1) SDP problems used for solving combinatorial optimization
problems such as clustering (e.g. K-Means and graph cut clustering) [
        <xref ref-type="bibr" rid="ref1 ref6">6, 1</xref>
        ].
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>We conducted 3 experiments, each time reporting the result of 6 iterations of the
alternative clustering scheme. Initial prior beliefs are always μ = 0 and Σ = I.
– A plain application to a synthetic dataset of 100 points and 2 dimensions in
4 clusters. See Fig. 1.
– An application to the same data but now with a Radial Basis Function
(RBF) kernel used for the inner products. See Fig. 2.
– An application with an RBF kernel to a different synthetic dataset, with
one central cluster and two half-moon shaped clusters around this central
cluster. See Fig. 3.
948.3664
923.4161</p>
      <p>0
55.4025
5
5
5
5
0
−5
10
5
0
−5
10
5
0
−5
27.9164</p>
      <p>0
18.9836</p>
      <p>0
2.5518
5
5
5
5
0
−5
10
5
0
−5
10
5
0
−5
11.324</p>
      <p>0
10.192</p>
      <p>0
8.1599
1
1
1
2
2
2
10.2221</p>
      <p>0
Fig. 3. A synthetic dataset with a central set of 20 data points surrounded by two half
moons of 40 data points each. The plots show the first 6 consecutive alternative clusters
when an RBF kernel with kernel width 0.3 is used. The numbers above the plots are
the values of ΔQk from Eq. (2) for the cluster shown. Note that the second cluster
generated contains all data points. This is possible and sensible from the perspective
of our approach if the mean of the entire data set is significantly different from the
expected mean in the initial background model. This may well be the case when working
in a Hilbert space induced by the RBF kernel, where all data points lie in one orthant
such that their mean cannot be in the origin.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] we introduced a framework for data mining, aiming to quantify the
subjective interestingness of patterns. We showed that Principal Component Analysis
can be seen as implementing this framework for a particular pattern type and
prior beliefs, thus providing an alternative justicfiation for this method. In earlier
work we also showed the potential of the framework in quantifying subjective
interestingness for frequent itemset mining [
        <xref ref-type="bibr" rid="ref2 ref4 ref5">2, 4, 5</xref>
        ]. Now, in the present paper, we
showed in detail how the framework can also be applied successfully to the case
of clustering, leading to a new approach for alternative clustering that presents
subjectively interesting clusters in data in an iterative data mining scheme.
      </p>
      <p>In further work, we will investigate the quality of the spectral relaxation, and
consider the development of tighter relaxations (e.g. to semi-denfiite programs).
We will also further develop links with spectral clustering and other existing
clustering approaches, to provide alternative justifications and insights or to
improve on these approaches. We will also investigate the use of other pattern
syntaxes for cluster(ing)s, and the use of more complex types of prior beliefs.
Lastly, we plan to demonstrate the power of the framework by also applying
these ideas to other types of data and corresponding types of prior beliefs, such
as positive real-valued, integer, binary, and more structured types of data.</p>
      <p>Due to space constraints, in this workshop paper we could not situate the
contributions within the wider literature on alternative clustering. We will rectify
this important shortcoming in a later version of this workshop paper.</p>
      <sec id="sec-7-1">
        <title>Acknowledgements</title>
        <p>This work is supported by the EPSRC grant EP/G056447/1.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>T. De Bie</surname>
            and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Cristianini</surname>
          </string-name>
          .
          <article-title>Fast sdp relaxations of graph cut clustering, transduction, and other combinatorial problems</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>7</volume>
          :
          <fpage>14091</fpage>
          -
          <lpage>436</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. T. De Bie.
          <article-title>Maximum entropy models and subjective interestingness: an application to tiles in binary databases</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>T. De Bie</surname>
          </string-name>
          .
          <article-title>An information-theoretic framework for data mining</article-title>
          .
          <source>In Proc. of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD11)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>T. De Bie</surname>
            ,
            <given-names>K.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Kontonasios</surname>
            , and
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Spyropoulou</surname>
          </string-name>
          .
          <article-title>A framework for mining interesting pattern sets</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>K.-N.</given-names>
            <surname>Kontonasios</surname>
          </string-name>
          and
          <string-name>
            <surname>T. De Bie</surname>
          </string-name>
          .
          <article-title>An information-theoretic approach to finding informative noisy tiles in binary databases</article-title>
          .
          <source>In Proceedings of the 2010 SIAM International Conference on Data Mining</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Jiming</given-names>
            <surname>Peng</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yu</given-names>
            <surname>Wei</surname>
          </string-name>
          .
          <article-title>Approximating k-means-type clustering via semidefinite programming</article-title>
          .
          <source>SIAM Journal on Optimization</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          .
          <article-title>Normalized cuts and image segmentation</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>22</volume>
          (
          <issue>8</issue>
          ):
          <fpage>8889</fpage>
          -
          <lpage>05</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>H.</given-names>
            <surname>Zha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Simon</surname>
          </string-name>
          .
          <article-title>Spectral relaxation for k-means clustering</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>14</volume>
          (
          <issue>NIPS01</issue>
          ), pages
          <fpage>10571</fpage>
          -
          <lpage>064</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>