<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fast Multidimensional Clustering of Categorical Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tengfei Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nevin L. Zhang</string-name>
          <email>lzhang@cse.ust.hk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kin Man Poon</string-name>
          <email>lkmpoon@cse.ust.hk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Wang</string-name>
          <email>wangy@comp.nus.edu.sg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hua Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science National University of</institution>
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science and Engineering The Hong Kong University of Science and Technology</institution>
        </aff>
      </contrib-group>
      <fpage>19</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>Early research work on clustering usually assumed that there was one true clustering of data. However, complex data are typically multifaceted and can be meaningfully clustered in many different ways. There is a growing interest in methods that produce multiple partitions of data. One such method is based on latent tree models (LTMs). This method has a number of advantages over alternative methods, but is computationally inefficient. We propose a fast algorithm for learning LTMs and show that the algorithm can produce rich and meaningful clustering results in moderately large data sets.</p>
      </abstract>
      <kwd-group>
        <kwd>Multidimensional Clustering</kwd>
        <kwd>Model-based Clustering</kwd>
        <kwd>Latent Tree Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>There are several clustering methods that produce multiple partitions. We refer
to them as multi-partition clustering (MPC) methods. MPC methods,
according to the way that partitions are found, can be divided into two categories:
sequential MPC methods and simultaneous MPC methods.</p>
      <p>
        Sequential MPC methods produce multiple clustering solutions sequentially.
One such kind of method is known as alternative clustering [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1–3</xref>
        ]. It aims to
discover a new clustering that is different from a previously known clustering. The
key issue is how to ensure the novelty of the new clustering. One can repeatedly
apply such methods to produce a sequence of clusterings.
      </p>
      <p>
        Simultaneous MPC methods, on the other hand, produce multiple
clustering solutions simultaneously. Both distance-based and model-based methods
have been proposed for simultaneous MPC. The distance-based methods
proposed by [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] require as inputs the number of partitions and the number of
clusters in each partition. They try to optimize the quality of each individual
partition while keeping different partitions as dissimilar as possible. Model-based
methods tfi data with a probabilistic model that contains multiple latent
variables. Each latent variable represents a soft partition and can be viewed as a
hidden dimension of data. So we refer to model-based simultaneous MPC also
as multidimensional clustering (MDC). Unlike distance-based methods,
modelbased methods can automatically determine the numbers of partitions and the
number of clusters in each partition based on statistical principles.
      </p>
      <p>
        Among the MDC methods, Galimberti et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Guan et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] use
what we call disjoint and independent view (DIV) models. In a DIV model, each
latent variable is associated with a subset of attributes, which gives one view of
data. The subsets for different latent variables are disjoint. A latent variable is
independent of all the other latent variables and all the attributes that are not
associated with it. On the other hand, Zhang [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Poon et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] use latent
tree models (LTMs). An LTM can be viewed as a DIV model with the latent
variables connected to form a tree structure.
      </p>
      <p>This paper is concerned with the use of LTMs for producing multiple
clustering solutions. Currently, there is a lack of efficient algorithms for learning
LTMs. The fastest algorithm takes weeks to process data sets with around 100
attributes and a few thousands samples. We propose a more efficient algorithm
that can analyze data with hundreds of attributes in a few hours. We also provide
empirical results to show that the algorithm can produce much richer clustering
results than alternative methods.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Latent Tree Models</title>
      <p>Technically an LTM is a Markov random field over an undirected graph, where
variables at leaf nodes are observed and variables at internal nodes are hidden.
An example of LTM is shown in Figure 1. It is part of the latent tree model
learned from WebKB data which will be described in more details in Section
4.2. The Y-variables are the latent variables. The numbers in parenthesis are
the numbers of states of the latent variables. The leaf nodes here are different
words which take binary values to indicates presence or absence of the word. The
edges represent probabilistic dependence and their width represents dependence
strength.</p>
      <p>For technical convenience, we often root an LTM at one of its latent nodes and
regard it as a directed graphical model, i.e., a Bayesian network. For the model in
Figure 1, suppose we use Y43 as the root. Then the edges are directed as pointing
away from Y43. For example, the edges Y43 − Y53 and Y53− ut should be directed
as Y43 → Y53 and Y53 → ut. The numerical information of the model includes
a marginal distribution P (Y43) for the root and one conditional distribution
for each edge. For the edge Y43 → Y53, we have distribution P (Y53|Y43); for
the edge Y53 → ut, we have distribution P (ut|Y53); and so on. The product of
those distributions defines a joint distribution over all the latent and observed
variables. In this paper, we assume all variables are discrete.
2.1</p>
      <sec id="sec-2-1">
        <title>The State of the Art</title>
        <p>
          Suppose there is a data set D. The attributes of the data set are the observed
variables. To learn an LTM from D, one needs to determine: (1) the number
of latent variables, (2) the number of states of each latent variable, (3) the
connections among the latent and observed variables, and (4) the probability
parameters. We use m to denote the information for the rfist three items and θ
to denote the collection of parameter values. We aim at nfiding the pair ( m, θ∗)
where θ∗ is the maximum likelihood estimate of the parameters and m maximizes
the BIC score [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]:
d(m)
BIC(m | D) = log P (D|m, θ∗) − log N
2
where d(m) is the number of free parameters in m and N is the sample size.
        </p>
        <p>
          Several algorithms for learning LTMs have been proposed. The
state-of-theart is an algorithm called EAST [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. It is a search-based method and is capable
of producing good models for data sets with dozens of attributes. However, it is
not efficient enough for data sets with more than 100 attributes. There are two
other algorithms that are more efficien t than EAST, but they focus on special
LTMs. Harmeling and Williams [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] consider only binary trees, while Choi et al.
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] assume all the variables share the same domain. Neither method is intended
for cluster analysis.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Bridged Islands Algorithm</title>
      <p>We now set out to present a new algorithm that is drastically more efficient than
EAST. In an LTM, the set of attributes that are connected to the same latent
variable is called a sibling cluster. Attributes in the cluster are said to be siblings.
In the LTM shown in Figure 1, attributes austin, utexas and ut form one sibling
cluster because they are all connected to Y53.</p>
      <p>The new algorithm proceeds in vfie steps:
1. Partition the set of attributes into sibling clusters;
2. For each sibling cluster introduce a latent variable and determine the number
of states of this variable;
3. Determine the connections among the latent variables so that they form a
tree;
4. Determine the values of the probability parameters;
5. Renfie the model.</p>
      <p>If we imagine the sibling clusters formed in Step 1, together the latent variables
added in Step 2, as islands in an ocean, then the islands are connected in Step
3. So we call the algorithm the bridged islands (BI) algorithm. In the following,
we describe each step of BI in details.</p>
      <p>Step 1: BI determines sibling clusters using two intuitions. First, attributes
from the same sibling cluster tend to be more closely correlated than those from
different sibling clusters. Second, if two attributes are siblings in the optimal
model for one set of attributes, they should also be siblings in the optimal model
for a subset of the attributes.</p>
      <p>BI determines the rfist sibling cluster as follows. There is a working subset
of attributes. Initially, it contains the pair of attributes with the highest mutual
information(MI). Here MI is computed from the empirical distribution of the
data. BI grows the working subset by adding other attributes into it one by one.
At each step, it chooses the attribute that has the highest MI with the current
subset. (The rfist intuition is used here.) The MI between a variable X and a
set S is estimated as follows:</p>
      <p>I(X ; S) = max I(X ; Z) = max P (X, Z) log (1)</p>
      <p>Z∈ S Z∈ S X,Z</p>
      <p>BI determines when to stop expanding the working subset using the
unidimensionality test or simply the UD-test . This is the most important idea of this
paper. A subset of attributes is unidimensional if the optimal LTM (i.e., the one
with the highest BIC score) for those attributes contains only one latent node.
UD-test determines whether a subset of attributes is unidimensional by rfist
projecting data onto those attributes and then running EAST on the projected
data. If the resulting model contains only one latent variable, then UD-test
concludes that the subset is unidimensional. Otherwise, it concludes the opposite.
For computational efficiency, EAST is allowed to examine only models with 1 or
2 latent variables.</p>
      <p>After each attribute is added to the working subset, BI runs the UD-test
on the subset. If the test fails, the attributes in the subset cannot all be in
one sibling cluster in the final model according to the second intuition. So, BI
stops growing the subset and picks, from the local model learned during
UDtest, the sibling cluster that contains (one of or both) the two initial attributes
as the rfist sibling cluster for the whole algorithm. Attributes in the cluster are
then removed from the data set and the process repeats to find other sibling
clusters. Figure 2 illustrates the process. The working subset initially contains
X1 and X2. Then X3 is added. EAST is run to find an LTM for the expanded
subset {X1, X2, X3}. The resulting model, shown in (a), contains only one latent
variable. So the triplet passes the UD-test. Then X4 is added and the UD-test
is again passed as shown in (b). After that X5 is added. This time EAST yields
a model, shown in (c), with more than one latent variable. So the UD-test fails
and BI stops growing the subset. The sibling cluster {X1, X2, X4} of the local
model is picked as the first sibling cluster for the whole algorithm.</p>
      <p>
        P (X, Z)
P (X )P (Z)
Step 2: A latent class model (LCM) is an LTM with only one latent node. Figure
2 (a) and (b) show two examples. It is a commonly used finite mixture model
for discrete data. At Step 2, BI learns an LCM for each sibling cluster. There
are two substasks: (1) to determine the cardinality of the latent variable and
(2) to optimize the probability parameters. BI starts by setting the cardinality
of the latent variable to 2 and optimizing the model parameters by running the
EM algorithm [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Then it considers repeatedly increasing the cardinality. After
each increase, model parameters are re-optimized. The process stops when the
BIC score ceases to increase.
      </p>
      <p>Step 3: After the first 2 steps, BI has obt ained a collection of LCMs. We can
visualize those LCMs as islands in an ocean. The next step is to link up the
islands in a tree formation by adding edges between the latent variables.</p>
      <p>
        Chow and Liu [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] give a well-known algorithm for learning tree-structured
models among observed variables. It fi rst estimates the MI between each pair
of variables from data, then constructs a complete undirected graph with the
MI values as edge weights, and finally finds the maximum spanning tree of the
graph. The resulting tree model has the maximum likelihood among all tree
models. Chow-Liu’s algorithm can be adapted to link up the latent variables of
the aforementioned LCMs. We only need to specify how the MI between two
latent variables is to be estimated. Let m1 and m2 be two LCMs with latent
variables Y1 and Y2. 3 Enumerate the data cases as d1, d2, . . . , dN . Let di1 and
di2 (i ∈ { 1,2, . . . ,}N) be respectively the projections of di onto the attributes
of m1 and m2. Denfie the data-conditional joint distribution of Y1 and Y2 as
follows:
      </p>
      <p>P (Y1, Y2|D, m1, m2) = C</p>
      <p>P (Y1|m1,di1)P (Y2|m2,di2),
(2)
i=1
where C is the normalization constant. We estimate the MI between Y1 and Y2
from this joint distribution.</p>
      <p>Step 4: In this step, BI optimizes the probability parameters of the LTM resulted
from Step 3 using EM.</p>
      <p>Step 5: The sibling clusters and the cardinalities of the latent variables were
determined in Steps 1 and 2. Each of those decisions was made in the context of a
small number of attributes. Now that all the variables are connected in a global
model, it is time to re-examine the decisions and to see whether adjustments
should be made.</p>
      <p>In Step 5, BI checks each attribute to see whether it should be relocated and
each latent variable to see if its cardinality should be changed. All the potential
adjustments are evaluated with respect to the model, denoted by mˆ, resulted
from the previous step. The beneficial adjustments are executed in one batch
after all the evaluations. Adjustment evaluations and adjustment executions are
not interleaved because that would require parameter optimization after each
adjustment and hence be computationally expensive.</p>
      <p>For each attribute X and each latent variable Y , BI computes their
dataconditional MI I(X, Y|D, mˆ) from the following distribution:
3 Here m1 and m2 include model parameter values.
P (X, Y|D, mˆ) = C</p>
      <p>P (X, Y| mˆ,di),</p>
      <p>(3)
i=1
where C is the normalization constant. Let Yˆ be the latent variable that has the
highest data-conditional MI with X . If Yˆ is not the current parent node of X in
mˆ, then it is deemed beneficial to relocate X from its parent node to Yˆ .</p>
      <p>To determine whether a change in the cardinality of a latent variable is
beneficial, BI freezes all the parameters that are not affected by the change, runs
EM locally to optimize the parameters affected by the change, and recalculates
the BIC score. The change is deemed beneficial if the BIC is increased. BI starts
from the current cardinality of each latent variable and considers increasing it
by one. If it is beneficial to do so, further increases are considered.</p>
      <p>After model renfiement, EM is run on the global model one more time to
optimize the parameters.</p>
      <p>Computational Efficiency : BI is much faster than EAST. The reason is that
most steps of BI involves only a small number of variables and the EM algorithm
is run on the global model only twice. We tested BI and EAST on two data sets
with 81/108 attributes and 3021/2763 samples respectively. EAST took 6/24
days, while BI took 35/69 minutes respectively. Detailed running time analysis
will be given in a longer version of the paper.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Empirical Results with BI</title>
      <p>We now present empirical results to demonstrate BI’s capability in discovering
rich and meaningful multiple clusterings. Comparisons with other methods will
be made in the next section.
4.1</p>
      <sec id="sec-4-1">
        <title>Clustering Synthetic Data</title>
        <p>
          As a test of concept, we rfist tried BI on synthetic data. The data set was
sampled from an LTM with structure as shown in Figure 3(left). There are 3
latent variables and 15 attributes, and all variables are binary. A total number of
1000 data cases were sampled. Each data case contains values for the attributes
X1-X15, but not for latent variables Y1-Y3. The data set has 3 ‘true’ partitions,
each given by a latent variable. To provide some intuitions on what the partitions
are about, Figure 3(right) depicts the normalized mutual information(NMI)[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]
between each partition and each attribute. The x-axis represents the observed
variables (i.e. X1-X15). The y-axis indicates the value of NMI between each
latent variable(i.e. Y1-Y3) and each attribute. There are three curves, labeled as
Y1-Y3. Those are called feature curves of the true partitions.
        </p>
        <p>The LTM obtained by BI from the synthetic data has the same structure
as the generative model. In particular, it has three latent variables. Each latent
variable represents a soft partition of the data. We obtained a hard partition by
assigning each data case to the state with the highest posterior probability. We
call those partitions the BI partitions. The curves labeled B1-B3 in Figure 3 are
the feature curves of the BI partitions. They match those of the true partitions
almost perfectly. Based on the feature curves, we matched up the BI partitions
with the true partitions, and computed the NMI for each pair. The NMI values
are 0.91(B1,Y1), 0.86(B2,Y2) and 0.98(B3,Y3) respectively. Those indicate that
BI has recovered the true partitions well.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Clustering Real-World Data</title>
        <p>The real-world data set is known as WebKB data. It consists of web pages
collected in 1997 from the computer science departments of 4 universities: Cornell,
Texas, Washington and Wisconsin. The web pages were originally divided into
7 classes. Only 4 classes remain after preprocessing, namely student, faculty,
project and course. There are 1041 pages in total and the number of words was
reduced to 336. The words are used as attributes in our analysis. The attributes
are binary and indicate the presence or absence of the words. Both the university
labels and class labels were removed before the analysis.</p>
        <p>On the WebKB data set, BI produced an LTM, henceforth called
WebKBLTM, with 98 latent variables. This means that BI has partitioned the data in
98 different ways. A big question is: Are those partitions meaningful or do they
appear to be arbitrary? It turns out that many of the partitions are meaningful.
We present some of them in the following.</p>
        <p>To start with, we give an overview of the structure of WebKB-LTM. It
divides itself into three parts. The model of the rfist part is shown in Figure 1.
It involves words that usually appear in general descriptions of computer
science departments. So we call it the department subnet. The second part contain
words such as research, interest, papers, ieee, etc. We call it the research subnet.
The third part involves words related to course teaching. So we call it the teach
subnet.</p>
        <p>The Department Subnet: We rfist examine two latent variables Y51 and Y54
in Figure 1. To inspect the meanings of partitions, we can draw the information
curves as shown in Figure 4. Take information curves of Y51 as example, there
are two curves in the gfiure. The lower one shows the mutual information(MI)
between Y51 and each attribute. The attributes are sorted according to MI and
only the top 5 attributes are shown here. The upper curve shows the cumulative
mutual information(CMI) between Y51 and each attribute plus all the attributes
before it. The numbers on the left vertical axis are MI values. The numbers on
the right axis are the ratios of the MI values over the MI between the partition
and all the attributes. The ratio is called information coverage (IC). The IC of
the first 5 attributes is around 98%. Intuitively, this means that the differences
among the different clusters on the first 5 attributes account for 98% of the
total differences. So we can say that the partition is primarily based on these
attributes.</p>
        <p>The table below shows the occurrence frequencies of the words in the clusters.
We see that Y51 represents a partition based on wisc, madison, wi, dayton and
wisconsin. The clusters Y51=2 and Y51=3 together correspond to U
WisconsinMadison. Y54 represents another partition based on austin, texas, utexas and ut.
The cluster Y54=2 corresponds to U Texas-Austin.
cSicnlituozyYhrsea4nt7cee:arllI C...01080=14494...21%314967 cSswwliuezaaasestthetlirenYg4t8o:nIC.0..1008=1394..021%113 ...3199128 cSdwwwmliuzaiiiasessydtccteoiosrnYnos5ni1n: ..00100I70C14=9......28080133%313644 ......3999991226248 cStauulieuuzttYxeses5taxt4eisa:nrsIC..000180=1299.....2%0183038634</p>
        <p>We can do similar analysis to Y47 and Y48. As shown in above tables, the
most informative attributes are selected based on information curves, which are
omitted to save space. Information coverages (IC) are given to show how well the
attributes collectively convey the meanings of the latent variables. It is clear that
Y47 and Y48 are also meaningful. Y47=2 seems to identify web pages from Cornell,
and Y48=2 and Y48=3 together seem to identify web pages from U
WashingtonSeattle.</p>
        <p>The Research Subnet: The following tables contain information about three
latent variables from the faculty subnet.</p>
        <p>Y10: IC=92%
cluster 1 2
image .02 .5
images .02 .38
visual .01 .31
pattern 0 .19
vision .02 .3
Size .91 .09</p>
        <p>Y11: IC=90%
cluster 1 2
management .02 .57
database .04 .58
storage 0 .29
databases .01 .33
query 0 .3
Size .88 .12</p>
        <p>Y14: IC=95%
cluster 1
april .02
march .01
february 0
conference .06
symposium .03
proceedings .04
january .02
international .05
Size .86</p>
        <p>Latent variable Y10 partitions the web pages based on words such as image,
pattern and vision. Y10=2 seems to identify web pages belonging to faculty
members who work in the vision/image analysis area. Y11 partitions the web pages
based on words such as database, storage and query. Y11=2 seems to identify
web pages belonging to faculty members who work in the database area. Other
latent variables in the faculty subset give us clusters of web pages for other
research areas such as artificial intelligence, networking, etc. Besides research
area-related partitions, the faculty subnet also gives us some other interesting
clusterings. One example is Y14. Y14=2 seems to identify web pages containing
papers published at conferences held in January-April.</p>
        <p>The Teach Subnet: The teach subnet gives us partitions based on various
aspects of course information. One example is Y66. It is clear from the
information given below, Y66=4 identifies web pages on objected-oriented programming
(OOP), while Y66=2 identiefis web pages on programming but not mentioning
OOP. Those might be web pages of OOP courses and of introductory
programming courses respectively. Y66=3 seems to correspond to web pages of other
courses that involve programming, while Y66=1 seems to mean web pages not
on programming.</p>
        <p>In summary, the BI algorithm has identified a variety of interesting clusters
in the WebKB data. The situation is similar on several other data sets that we
tried. It should be noted that not all the results obtained by BI are meaningful
and interesting. After all, it is an explorative data analysis method.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Comparisons with Alternative Methods</title>
      <p>
        In this section we compare BI with four other MPC algorithms: orthogonal
projection (OP) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], singular alternative clustering (SAC) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], DK [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and EAST.
Those are the algorithms that we were able to obtain implementations for or
implement ourselves.
      </p>
      <p>The motivation for MPC is the observation that complex data can be
meaningfully clustered in multiple ways. As such, we need to consider two questions
when comparing different MPC methods: (1) Do they find meaningful
clusterings? (2) How many meaningful clusterings do they find?
5.1</p>
      <sec id="sec-5-1">
        <title>Meaningful Clustering</title>
        <p>A common way to evaluate a single-partition clustering algorithm is to start with
labeled data, remove the class labels, perform cluster analysis, and compare the
partition obtained with the partition induced by the class labels. We refer to
those two partitions as the cluster partition and the class partition respectively.
The quality of the cluster partition is often measured using its NMI with the
class partition.</p>
        <p>In the context of MPC, we have multiple class partitions and multiple cluster
partitions. How should the evaluation be carried out? Following the literature,
we match each class partition up with the cluster partition with which it has the
highest NMI and report the NMI values of the matched pairs. This means that
if one of the cluster partitions closely resemble a class partition, then we claim
that the class partition is recovered from data. Similar partitions have similar
feature curves. So, we sometimes can do the match up based on feature curves,
as was done in Section 4.1. A nice thing with the use of feature curves is that
they tell us what the partitions are about.</p>
        <p>Results on Synthetic Data: We tested OP, SAC, DK and EAST on the
same synthetic data that was used to test BI in Section 4.1. The published
version of DK can only produce two partitions. We ran DK on the synthetic data
to obtain two, instead of three, binary partitions. OP and SAC are sequential
MPC methods. Following the authors, we rfist ran K-means to nfid a partition
of two clusters, and then continue to run OP and SAC to find two more binary
partitions.</p>
        <p>We have run the experiments 10 times for each algorithm. Here are the
NMI values between each true class partition and the matched cluster partition
obtained by the algorithms:</p>
        <p>DK SAC OP EAST BI
Y1 .78±.00 .73±.04 .73±.02 .91±.00 .91±.00
Y2 .48±.00 .57±.04 .55±.04 .86±.00 .86±.00</p>
        <p>Y3 .97±.00 .59±.49 .91±.20 .98±.00 .98±.00</p>
        <p>The NMI values are high for EAST and BI, indicating that those two
algorithms have recovered the three true class partitions well. On the other hand,
the NMI values for the other algorithms are relatively lower, indicating that they
have not been able to recover the true class partitions as well as EAST and BI.</p>
        <p>The results by EAST and BI are identical in terms of NMI values. However,
BI took much less time than EAST. The running time of BI was 22±3 seconds,
while that of EAST was 485±34 seconds.</p>
        <p>Results on Real-World Data: DK, OP and SAC were also tested on the
WebKB data. They were told to nfid two partitions each with four clusters,
because it is known that there are two true class partitions each with four classes.</p>
        <p>One of class partition divides the web pages into four classes according to
the four universities. It is clear from Section 4.2 that BI has recovered the four
classes. However, they were given in the form of four partitions (Y47, Y48, Y51 and
Y54), instead of one. For comparability, we transformed the 4-class class partition
into four logically equivalent binary class partitions. Each binary class partition
divides the web pages according to whether they are from a particular university.
The same transformation was applied to the other class partition and the cluster
partitions obtained by the alternative algorithms. After the transformations, we
matched up the binary class partitions with the cluster partitions and computed
the NMI of each matched pair. The results are shown in following tables. The
average was taken over 10 runs of the algorithms.</p>
        <p>We see that the NMI is the highest for BI in almost all cases. This means that
BI has recovered most of the 8 classes better than the alternative algorithms.4
4 As seen in Section 4.2, some of the partitions obtained by BI contain multiple clusters
that correspond to a true class. For example, both Y48 = 2 and Y48 = 3 correspond</p>
        <p>DK SAC OP BI
course .43±.01 .47±.01 .47±.02 .59±.01
faculty .18±.04 .17±.07 .18±.01 .31±.01
project .04±.00 .04±.00 .05±.04 .07±.00
student .18±.00 .20±.00 .20±.01 .20±.02
cornell .22±.15 .09±.02 .36±.24 .48±.11
texas .31±.18 .20±.20 .45±.23 .60±.02
washington .22±.13 .41±.23 .56±.25 .52±.12
wisconsin .38±.12 .16±.12 .45±.13 .48±.10</p>
        <p>We also tested EAST on WebKb data. However, it did not nfiish in two
weeks. In contrast, BI took only 90 minutes.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Richness of Clusterings</title>
        <p>
          The WebKB data was analyzed by a number of authors using different methods
[
          <xref ref-type="bibr" rid="ref1 ref5 ref7">1, 5, 7</xref>
          ]. In all cases, only two partitions and a few clusters were obtained. In
contrast, BI has discovered , as shown in Section 4.2, a rich collection of meaningful
clusters. This is what sets BI far apart from other methods.
        </p>
        <p>Why is it difficult for other methods to produce rich clustering results on
complex data sets? Well, the sequential MPC methods and the distance-based
simultaneous MPC methods cannot determine the numbers of partitions and
clusters. The user has to provide such information as input. In single partitioning,
not being able to determine the number of clusters is not a big problem. One can
manually try a few possibilities and pick the one that yields the best results. In
the case of MPC, however, not being able to determine the numbers of partitions
and clusters is a severe drawback. There are too many possibilities to consider. As
a simplicfiation of the problem, suppose it is known that there are 10 partitions.
Determining the number of clusters for the 10 partitions manually would be very
difficult as there are too many possible combinations to try.</p>
        <p>
          Other model-based simultaneous MPC methods can determine the numbers
of partitions and clusters. However, they are still unable to produce rich
clustering results on complex data sets. The algorithm by Galimberti and Soffiritti
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is inefficient and has so far been tested only on data sets with fewer than 10
attributes. The EAST algorithm is not efficient enough to handle data sets with
100 or more attributes. The method by Guan et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is efficient enough to deal
with the WebKb data, but it produces only two partitions.
5.3
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>A Remark</title>
        <p>The evaluation method used in Section 5.1 has one drawback. A na¨ vıe method
that generates a huge number of random partitions might get good evaluation.
So, it is important to test MPC methods on real-world data set and see whether
they can help find meaningful clusters. On the WebKB data, BI found 98
partitions. By inspecting their information curves, we were able to quickly identify
dozens of meaningful partitions. With the random method, however, one would
have to examine a huge number of partitions before nfiding a meaningful one.
Hence it is not useful.</p>
        <p>to U Washington-Seattle. If such clusters are manually aggregated, the NMI values
for BI would look better.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>This paper is concerned with the use of latent tree models (LTMs) for multiple
partition clustering. We propose a new algorithm, called BI, for learning LTM
that is much faster than the best previous algorithm and can handle data with
hundreds of attributes. The key idea behind the algorithm is UD-test, which
determines whether the interactions among a set of observed variables can be
modeled using one latent variable. Empirical results are presented to show that
BI is able to produce much richer clustering results than alternative methods.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>Research on this work was supported by Hong Kong Research Grants Council
GRF Grant #622408, The National Basic Research Program of China (aka the
973 Program) under project No. 2011CB505101, and HKUST Fok Ying Tung
Graduate School.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fern</surname>
            ,
            <given-names>X.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dy</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          :
          <article-title>Non-reduntant multi-view clustering via orthogonalization</article-title>
          .
          <source>In: ICDM-07</source>
          . (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davidson</surname>
            ,
            <given-names>I.:</given-names>
          </string-name>
          <article-title>A principled and flexible framework for finding alternative clusterings</article-title>
          .
          <source>In: KDD-09</source>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gondek</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Non-redundant data clustering</article-title>
          . KAIS-
          <volume>07</volume>
          (
          <issue>1</issue>
          ) (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meka</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dhillon</surname>
            ,
            <given-names>I.S.:</given-names>
          </string-name>
          <article-title>Simultaneous unsupervised learning of disparate clusterings</article-title>
          .
          <source>In: SDM-08</source>
          . (
          <year>2008</year>
          )
          <fpage>8588</fpage>
          -
          <lpage>69</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dy</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Multiple non-redundant spectral clustering views</article-title>
          .
          <source>In: ICML-10</source>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Galimberti</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soffritti</surname>
          </string-name>
          , G.:
          <article-title>Model-based methods to identify multiple cluster structures in a data set</article-title>
          .
          <source>CSDA-07</source>
          <volume>52</volume>
          (
          <year>2007</year>
          )
          <fpage>5205</fpage>
          -
          <lpage>36</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Guan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dy</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Variational inference for nonparametric multiple clustering</article-title>
          .
          <source>In: MultiClust Workshop</source>
          , KDD-
          <year>2010</year>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>N.L.</given-names>
          </string-name>
          :
          <article-title>Hierarchical latent class models for cluster analysis</article-title>
          .
          <source>JMLR-04 5</source>
          (
          <year>2004</year>
          )
          <fpage>6977</fpage>
          -
          <lpage>23</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Poon</surname>
            ,
            <given-names>L.K.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>N.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Variable selection in model-based clustering: To do or to facilitate</article-title>
          .
          <source>In: ICML-10</source>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Schwarz</surname>
          </string-name>
          , G.:
          <article-title>Estimating the dimension of a model</article-title>
          .
          <source>The Annals of Statistics</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ) (
          <year>1978</year>
          )
          <fpage>4614</fpage>
          -
          <lpage>64</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>N.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Efficient model evaluation in the search-based approach to latent structure discovery</article-title>
          .
          <source>In: PGM-08</source>
          ,
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Harmeling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C.K.I.</given-names>
          </string-name>
          :
          <article-title>Greedy learning of binary latent trees</article-title>
          .
          <source>TPAMI-10</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>V.Y.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anandkumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Willsky</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          :
          <article-title>Learning latent tree graphical models</article-title>
          .
          <source>Computing Research Repository</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Koller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
          </string-name>
          , N.:
          <article-title>Probabilistic Graphical Models: Principles and Techniques</article-title>
          . MIT Press (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Chow</surname>
            ,
            <given-names>C.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>C.N.</given-names>
          </string-name>
          :
          <article-title>Approximating discrete probability distributions with dependence trees</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          (
          <year>1968</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Strehl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
          </string-name>
          , J.:
          <article-title>Cluster ensembles - a knowledge reuse framework for combining multiple partitions</article-title>
          .
          <source>JMLR 3</source>
          (
          <year>2002</year>
          )
          <fpage>5836</fpage>
          -
          <lpage>17</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>