<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Fast Multidimensional Clustering of Categorical Data</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Tengfei</forename><surname>Liu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Engineering</orgName>
								<orgName type="institution">The Hong Kong University of Science and Technology</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nevin</forename><forename type="middle">L</forename><surname>Zhang</surname></persName>
							<email>lzhang@cse.ust.hk</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Engineering</orgName>
								<orgName type="institution">The Hong Kong University of Science and Technology</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kin</forename><forename type="middle">Man</forename><surname>Poon</surname></persName>
							<email>lkmpoon@cse.ust.hk</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Engineering</orgName>
								<orgName type="institution">The Hong Kong University of Science and Technology</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yi</forename><surname>Wang</surname></persName>
							<email>wangy@comp.nus.edu.sg</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">National University of Singapore</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hua</forename><surname>Liu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Engineering</orgName>
								<orgName type="institution">The Hong Kong University of Science and Technology</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Fast Multidimensional Clustering of Categorical Data</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">16122E6B9CF4580E5CCE306BCB041723</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T13:42+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Multidimensional Clustering</term>
					<term>Model-based Clustering</term>
					<term>Latent Tree Models</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Early research work on clustering usually assumed that there was one true clustering of data. However, complex data are typically multifaceted and can be meaningfully clustered in many different ways. There is a growing interest in methods that produce multiple partitions of data. One such method is based on latent tree models (LTMs). This method has a number of advantages over alternative methods, but is computationally inefficient. We propose a fast algorithm for learning LTMs and show that the algorithm can produce rich and meaningful clustering results in moderately large data sets.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>There are several clustering methods that produce multiple partitions. We refer to them as multi-partition clustering (MPC) methods. MPC methods, according to the way that partitions are found, can be divided into two categories: sequential MPC methods and simultaneous MPC methods.</p><p>Sequential MPC methods produce multiple clustering solutions sequentially. One such kind of method is known as alternative clustering <ref type="bibr" target="#b0">[1]</ref><ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref>. It aims to discover a new clustering that is different from a previously known clustering. The key issue is how to ensure the novelty of the new clustering. One can repeatedly apply such methods to produce a sequence of clusterings.</p><p>Simultaneous MPC methods, on the other hand, produce multiple clustering solutions simultaneously. Both distance-based and model-based methods have been proposed for simultaneous MPC. The distance-based methods proposed by <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref> require as inputs the number of partitions and the number of clusters in each partition. They try to optimize the quality of each individual partition while keeping different partitions as dissimilar as possible. Model-based methods fit data with a probabilistic model that contains multiple latent variables. Each latent variable represents a soft partition and can be viewed as a hidden dimension of data. So we refer to model-based simultaneous MPC also as multidimensional clustering (MDC). Unlike distance-based methods, modelbased methods can automatically determine the numbers of partitions and the number of clusters in each partition based on statistical principles.</p><p>Among the MDC methods, Galimberti et al. <ref type="bibr" target="#b5">[6]</ref> and Guan et al. <ref type="bibr" target="#b6">[7]</ref> use what we call disjoint and independent view (DIV) models. In a DIV model, each latent variable is associated with a subset of attributes, which gives one view of data. The subsets for different latent variables are disjoint. A latent variable is independent of all the other latent variables and all the attributes that are not associated with it. On the other hand, Zhang <ref type="bibr" target="#b7">[8]</ref> and Poon et al. <ref type="bibr" target="#b8">[9]</ref> use latent tree models (LTMs). An LTM can be viewed as a DIV model with the latent variables connected to form a tree structure.</p><p>This paper is concerned with the use of LTMs for producing multiple clustering solutions. Currently, there is a lack of efficient algorithms for learning LTMs. The fastest algorithm takes weeks to process data sets with around 100 attributes and a few thousands samples. We propose a more efficient algorithm that can analyze data with hundreds of attributes in a few hours. We also provide empirical results to show that the algorithm can produce much richer clustering results than alternative methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Latent Tree Models</head><p>Technically an LTM is a Markov random field over an undirected graph, where variables at leaf nodes are observed and variables at internal nodes are hidden. An example of LTM is shown in Figure <ref type="figure" target="#fig_1">1</ref>. It is part of the latent tree model learned from WebKB data which will be described in more details in Section 4.2. The Y-variables are the latent variables. The numbers in parenthesis are the numbers of states of the latent variables. The leaf nodes here are different words which take binary values to indicates presence or absence of the word. The edges represent probabilistic dependence and their width represents dependence strength.</p><p>For technical convenience, we often root an LTM at one of its latent nodes and regard it as a directed graphical model, i.e., a Bayesian network. For the model in Figure <ref type="figure" target="#fig_1">1</ref> ; and so on. The product of those distributions defines a joint distribution over all the latent and observed variables. In this paper, we assume all variables are discrete.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">The State of the Art</head><p>Suppose there is a data set D. The attributes of the data set are the observed variables. To learn an LTM from D, one needs to determine: (1) the number of latent variables, (2) the number of states of each latent variable, (3) the connections among the latent and observed variables, and (4) the probability parameters. We use m to denote the information for the first three items and θ to denote the collection of parameter values. We aim at finding the pair (m, θ * ) where θ * is the maximum likelihood estimate of the parameters and m maximizes the BIC score <ref type="bibr" target="#b9">[10]</ref>:</p><formula xml:id="formula_0">BIC(m | D) = log P (D|m, θ * ) − d(m) 2 log N</formula><p>where d(m) is the number of free parameters in m and N is the sample size. Several algorithms for learning LTMs have been proposed. The state-of-theart is an algorithm called EAST <ref type="bibr" target="#b10">[11]</ref>. It is a search-based method and is capable of producing good models for data sets with dozens of attributes. However, it is not efficient enough for data sets with more than 100 attributes. There are two other algorithms that are more efficient than EAST, but they focus on special LTMs. Harmeling and Williams <ref type="bibr" target="#b11">[12]</ref> consider only binary trees, while Choi et al. <ref type="bibr" target="#b12">[13]</ref> assume all the variables share the same domain. Neither method is intended for cluster analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">The Bridged Islands Algorithm</head><p>We now set out to present a new algorithm that is drastically more efficient than EAST. In an LTM, the set of attributes that are connected to the same latent variable is called a sibling cluster. Attributes in the cluster are said to be siblings. In the LTM shown in Figure <ref type="figure" target="#fig_1">1</ref>, attributes austin, utexas and ut form one sibling cluster because they are all connected to Y 53 .</p><p>The new algorithm proceeds in five steps:</p><p>1. Partition the set of attributes into sibling clusters; 2. For each sibling cluster introduce a latent variable and determine the number of states of this variable; 3. Determine the connections among the latent variables so that they form a tree; 4. Determine the values of the probability parameters; 5. Refine the model.</p><p>If we imagine the sibling clusters formed in Step 1, together the latent variables added in Step 2, as islands in an ocean, then the islands are connected in Step 3. So we call the algorithm the bridged islands (BI) algorithm. In the following, we describe each step of BI in details.</p><p>Step 1: BI determines sibling clusters using two intuitions. First, attributes from the same sibling cluster tend to be more closely correlated than those from different sibling clusters. Second, if two attributes are siblings in the optimal model for one set of attributes, they should also be siblings in the optimal model for a subset of the attributes.</p><p>BI determines the first sibling cluster as follows. There is a working subset of attributes. Initially, it contains the pair of attributes with the highest mutual information(MI). Here MI is computed from the empirical distribution of the data. BI grows the working subset by adding other attributes into it one by one. At each step, it chooses the attribute that has the highest MI with the current subset. (The first intuition is used here.) The MI between a variable X and a set S is estimated as follows:</p><formula xml:id="formula_1">I(X; S) = max Z∈S I(X; Z) = max Z∈S X,Z P (X, Z) log P (X, Z) P (X)P (Z)<label>(1)</label></formula><p>BI determines when to stop expanding the working subset using the unidimensionality test or simply the UD-test. This is the most important idea of this paper. A subset of attributes is unidimensional if the optimal LTM (i.e., the one with the highest BIC score) for those attributes contains only one latent node. UD-test determines whether a subset of attributes is unidimensional by first projecting data onto those attributes and then running EAST on the projected data. If the resulting model contains only one latent variable, then UD-test concludes that the subset is unidimensional. Otherwise, it concludes the opposite. For computational efficiency, EAST is allowed to examine only models with 1 or 2 latent variables.</p><p>After each attribute is added to the working subset, BI runs the UD-test on the subset. If the test fails, the attributes in the subset cannot all be in one sibling cluster in the final model according to the second intuition. So, BI stops growing the subset and picks, from the local model learned during UDtest, the sibling cluster that contains (one of or both) the two initial attributes as the first sibling cluster for the whole algorithm. Attributes in the cluster are then removed from the data set and the process repeats to find other sibling clusters. Figure <ref type="figure" target="#fig_2">2</ref> illustrates the process. The working subset initially contains X 1 and X 2 . Then X 3 is added. EAST is run to find an LTM for the expanded subset {X 1 , X 2 , X 3 }. The resulting model, shown in (a), contains only one latent variable. So the triplet passes the UD-test. Then X 4 is added and the UD-test is again passed as shown in (b). After that X 5 is added. This time EAST yields a model, shown in (c), with more than one latent variable. So the UD-test fails and BI stops growing the subset. The sibling cluster {X 1 , X 2 , X 4 } of the local model is picked as the first sibling cluster for the whole algorithm.  (2) to optimize the probability parameters. BI starts by setting the cardinality of the latent variable to 2 and optimizing the model parameters by running the EM algorithm <ref type="bibr" target="#b13">[14]</ref>. Then it considers repeatedly increasing the cardinality. After each increase, model parameters are re-optimized. The process stops when the BIC score ceases to increase.</p><p>Step 3: After the first 2 steps, BI has obtained a collection of LCMs. We can visualize those LCMs as islands in an ocean. The next step is to link up the islands in a tree formation by adding edges between the latent variables.</p><p>Chow and Liu <ref type="bibr" target="#b14">[15]</ref> give a well-known algorithm for learning tree-structured models among observed variables. It first estimates the MI between each pair of variables from data, then constructs a complete undirected graph with the MI values as edge weights, and finally finds the maximum spanning tree of the graph. The resulting tree model has the maximum likelihood among all tree models. Chow-Liu's algorithm can be adapted to link up the latent variables of the aforementioned LCMs. We only need to specify how the MI between two latent variables is to be estimated. Let m 1 and m 2 be two LCMs with latent variables Y 1 and Y 2 . <ref type="foot" target="#foot_0">3</ref> Enumerate the data cases as d 1 , d 2 , . . . , d N . Let d i1 and d i2 (i ∈ {1, 2, . . . , N} ) be respectively the projections of d i onto the attributes of m 1 and m 2 . Define the data-conditional joint distribution of Y 1 and Y 2 as follows:</p><formula xml:id="formula_2">P (Y 1 , Y 2 |D, m 1 , m 2 ) = C N i=1 P (Y 1 |m 1 , d i1 )P (Y 2 |m 2 , d i2 ), (<label>2</label></formula><formula xml:id="formula_3">)</formula><p>where C is the normalization constant. We estimate the MI between Y 1 and Y 2 from this joint distribution.</p><p>Step 4: In this step, BI optimizes the probability parameters of the LTM resulted from Step 3 using EM.</p><p>Step 5: The sibling clusters and the cardinalities of the latent variables were determined in Steps 1 and 2. Each of those decisions was made in the context of a small number of attributes. Now that all the variables are connected in a global model, it is time to re-examine the decisions and to see whether adjustments should be made. In Step 5, BI checks each attribute to see whether it should be relocated and each latent variable to see if its cardinality should be changed. All the potential adjustments are evaluated with respect to the model, denoted by m, resulted from the previous step. The beneficial adjustments are executed in one batch after all the evaluations. Adjustment evaluations and adjustment executions are not interleaved because that would require parameter optimization after each adjustment and hence be computationally expensive.</p><p>For each attribute X and each latent variable Y , BI computes their dataconditional MI I(X, Y |D, m) from the following distribution:</p><formula xml:id="formula_4">P (X, Y |D, m) = C N i=1 P (X, Y | m, d i ), (<label>3</label></formula><formula xml:id="formula_5">)</formula><p>where C is the normalization constant. Let Ŷ be the latent variable that has the highest data-conditional MI with X. If Ŷ is not the current parent node of X in m, then it is deemed beneficial to relocate X from its parent node to Ŷ .</p><p>To determine whether a change in the cardinality of a latent variable is beneficial, BI freezes all the parameters that are not affected by the change, runs EM locally to optimize the parameters affected by the change, and recalculates the BIC score. The change is deemed beneficial if the BIC is increased. BI starts from the current cardinality of each latent variable and considers increasing it by one. If it is beneficial to do so, further increases are considered.</p><p>After model refinement, EM is run on the global model one more time to optimize the parameters. Computational Efficiency: BI is much faster than EAST. The reason is that most steps of BI involves only a small number of variables and the EM algorithm is run on the global model only twice. We tested BI and EAST on two data sets with 81/108 attributes and 3021/2763 samples respectively. EAST took 6/24 days, while BI took 35/69 minutes respectively. Detailed running time analysis will be given in a longer version of the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Empirical Results with BI</head><p>We now present empirical results to demonstrate BI's capability in discovering rich and meaningful multiple clusterings. Comparisons with other methods will be made in the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Clustering Synthetic Data</head><p>As a test of concept, we first tried BI on synthetic data. The data set was sampled from an LTM with structure as shown in Figure <ref type="figure" target="#fig_4">3</ref>(left). There are 3 latent variables and 15 attributes, and all variables are binary. A total number of 1000 data cases were sampled. Each data case contains values for the attributes X 1 -X 15 , but not for latent variables Y 1 -Y 3 . The data set has 3 'true' partitions, each given by a latent variable. To provide some intuitions on what the partitions are about, Figure <ref type="figure" target="#fig_4">3</ref>(right) depicts the normalized mutual information(NMI) <ref type="bibr" target="#b15">[16]</ref> between each partition and each attribute. The x-axis represents the observed variables (i.e. X 1 -X 15 ). The y-axis indicates the value of NMI between each latent variable(i.e. Y 1 -Y 3 ) and each attribute. There are three curves, labeled as Y 1 -Y 3 . Those are called feature curves of the true partitions.</p><p>The LTM obtained by BI from the synthetic data has the same structure as the generative model. In particular, it has three latent variables. Each latent variable represents a soft partition of the data. We obtained a hard partition by assigning each data case to the state with the highest posterior probability. We call those partitions the BI partitions. The curves labeled B 1 -B 3 in Figure <ref type="figure" target="#fig_4">3</ref> are </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Clustering Real-World Data</head><p>The real-world data set is known as WebKB data. It consists of web pages collected in 1997 from the computer science departments of 4 universities: Cornell, Texas, Washington and Wisconsin. The web pages were originally divided into 7 classes. Only 4 classes remain after preprocessing, namely student, faculty, project and course. There are 1041 pages in total and the number of words was reduced to 336. The words are used as attributes in our analysis. The attributes are binary and indicate the presence or absence of the words. Both the university labels and class labels were removed before the analysis.</p><p>On the WebKB data set, BI produced an LTM, henceforth called WebKB-LTM, with 98 latent variables. This means that BI has partitioned the data in 98 different ways. A big question is: Are those partitions meaningful or do they appear to be arbitrary? It turns out that many of the partitions are meaningful. We present some of them in the following.</p><p>To start with, we give an overview of the structure of WebKB-LTM. It divides itself into three parts. The model of the first part is shown in Figure <ref type="figure" target="#fig_1">1</ref>. It involves words that usually appear in general descriptions of computer science departments. So we call it the department subnet. The second part contain words such as research, interest, papers, ieee, etc. We call it the research subnet. The third part involves words related to course teaching. So we call it the teach subnet.</p><p>The Department Subnet: We first examine two latent variables Y 51 and Y 54 in Figure <ref type="figure" target="#fig_1">1</ref>. To inspect the meanings of partitions, we can draw the information curves as shown in Figure <ref type="figure" target="#fig_5">4</ref>. Take information curves of Y 51 as example, there are two curves in the figure. The lower one shows the mutual information(MI) between Y 51 and each attribute. The attributes are sorted according to MI and only the top 5 attributes are shown here. The upper curve shows the cumulative mutual information(CMI) between Y 51 and each attribute plus all the attributes before it. The numbers on the left vertical axis are MI values. The numbers on the right axis are the ratios of the MI values over the MI between the partition and all the attributes. The ratio is called information coverage (IC). The IC of the first 5 attributes is around 98%. Intuitively, this means that the differences among the different clusters on the first 5 attributes account for 98% of the total differences. So we can say that the partition is primarily based on these attributes.</p><p>The table below shows the occurrence frequencies of the words in the clusters. We see that Y 51 represents a partition based on wisc, madison, wi, dayton and wisconsin. We can do similar analysis to Y 47 and Y 48 . As shown in above tables, the most informative attributes are selected based on information curves, which are omitted to save space. Information coverages (IC) are given to show how well the attributes collectively convey the meanings of the latent variables. It is clear that Y 47 and Y 48 are also meaningful. Y 47 =2 seems to identify web pages from Cornell, and Y 48 =2 and Y 48 =3 together seem to identify web pages from U Washington-Seattle.</p><p>The Research Subnet: The following tables contain information about three latent variables from the faculty subnet. Latent variable Y 10 partitions the web pages based on words such as image, pattern and vision. Y 10 =2 seems to identify web pages belonging to faculty members who work in the vision/image analysis area. Y 11 partitions the web pages based on words such as database, storage and query. Y 11 =2 seems to identify web pages belonging to faculty members who work in the database area. Other latent variables in the faculty subset give us clusters of web pages for other research areas such as artificial intelligence, networking, etc. Besides research area-related partitions, the faculty subnet also gives us some other interesting clusterings. One example is Y 14 . Y 14 =2 seems to identify web pages containing papers published at conferences held in January-April. The Teach Subnet: The teach subnet gives us partitions based on various aspects of course information. One example is Y 66 . It is clear from the information given below, Y 66 =4 identifies web pages on objected-oriented programming (OOP), while Y 66 =2 identifies web pages on programming but not mentioning OOP. Those might be web pages of OOP courses and of introductory programming courses respectively. Y 66 =3 seems to correspond to web pages of other courses that involve programming, while Y 66 =1 seems to mean web pages not on programming. In summary, the BI algorithm has identified a variety of interesting clusters in the WebKB data. The situation is similar on several other data sets that we tried. It should be noted that not all the results obtained by BI are meaningful and interesting. After all, it is an explorative data analysis method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Comparisons with Alternative Methods</head><p>In this section we compare BI with four other MPC algorithms: orthogonal projection (OP) <ref type="bibr" target="#b0">[1]</ref>, singular alternative clustering (SAC) <ref type="bibr" target="#b1">[2]</ref>, DK <ref type="bibr" target="#b3">[4]</ref> and EAST. Those are the algorithms that we were able to obtain implementations for or implement ourselves.</p><p>The motivation for MPC is the observation that complex data can be meaningfully clustered in multiple ways. As such, we need to consider two questions when comparing different MPC methods: (1) Do they find meaningful clusterings? (2) How many meaningful clusterings do they find?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Meaningful Clustering</head><p>A common way to evaluate a single-partition clustering algorithm is to start with labeled data, remove the class labels, perform cluster analysis, and compare the partition obtained with the partition induced by the class labels. We refer to those two partitions as the cluster partition and the class partition respectively. The quality of the cluster partition is often measured using its NMI with the class partition.</p><p>In the context of MPC, we have multiple class partitions and multiple cluster partitions. How should the evaluation be carried out? Following the literature, we match each class partition up with the cluster partition with which it has the highest NMI and report the NMI values of the matched pairs. This means that if one of the cluster partitions closely resemble a class partition, then we claim that the class partition is recovered from data. Similar partitions have similar feature curves. So, we sometimes can do the match up based on feature curves, as was done in Section 4.1. A nice thing with the use of feature curves is that they tell us what the partitions are about. Results on Synthetic Data: We tested OP, SAC, DK and EAST on the same synthetic data that was used to test BI in Section 4.1. The published version of DK can only produce two partitions. We ran DK on the synthetic data to obtain two, instead of three, binary partitions. OP and SAC are sequential MPC methods. Following the authors, we first ran K-means to find a partition of two clusters, and then continue to run OP and SAC to find two more binary partitions.</p><p>We have run the experiments 10 times for each algorithm. Here are the NMI values between each true class partition and the matched cluster partition obtained by the algorithms: The NMI values are high for EAST and BI, indicating that those two algorithms have recovered the three true class partitions well. On the other hand, the NMI values for the other algorithms are relatively lower, indicating that they have not been able to recover the true class partitions as well as EAST and BI.</p><p>The results by EAST and BI are identical in terms of NMI values. However, BI took much less time than EAST. The running time of BI was 22±3 seconds, while that of EAST was 485±34 seconds. Results on Real-World Data: DK, OP and SAC were also tested on the WebKB data. They were told to find two partitions each with four clusters, because it is known that there are two true class partitions each with four classes.</p><p>One of class partition divides the web pages into four classes according to the four universities. It is clear from Section 4.2 that BI has recovered the four classes. However, they were given in the form of four partitions (Y 47 , Y 48 , Y 51 and Y 54 ), instead of one. For comparability, we transformed the 4-class class partition into four logically equivalent binary class partitions. Each binary class partition divides the web pages according to whether they are from a particular university. The same transformation was applied to the other class partition and the cluster partitions obtained by the alternative algorithms. After the transformations, we matched up the binary class partitions with the cluster partitions and computed the NMI of each matched pair. The results are shown in following tables. The average was taken over 10 runs of the algorithms.</p><p>We see that the NMI is the highest for BI in almost all cases. This means that BI has recovered most of the 8 classes better than the alternative algorithms. We also tested EAST on WebKb data. However, it did not finish in two weeks. In contrast, BI took only 90 minutes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Richness of Clusterings</head><p>The WebKB data was analyzed by a number of authors using different methods <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b6">7]</ref>. In all cases, only two partitions and a few clusters were obtained. In contrast, BI has discovered , as shown in Section 4.2, a rich collection of meaningful clusters. This is what sets BI far apart from other methods.</p><p>Why is it difficult for other methods to produce rich clustering results on complex data sets? Well, the sequential MPC methods and the distance-based simultaneous MPC methods cannot determine the numbers of partitions and clusters. The user has to provide such information as input. In single partitioning, not being able to determine the number of clusters is not a big problem. One can manually try a few possibilities and pick the one that yields the best results. In the case of MPC, however, not being able to determine the numbers of partitions and clusters is a severe drawback. There are too many possibilities to consider. As a simplification of the problem, suppose it is known that there are 10 partitions. Determining the number of clusters for the 10 partitions manually would be very difficult as there are too many possible combinations to try.</p><p>Other model-based simultaneous MPC methods can determine the numbers of partitions and clusters. However, they are still unable to produce rich clustering results on complex data sets. The algorithm by Galimberti and Soffiritti <ref type="bibr" target="#b5">[6]</ref> is inefficient and has so far been tested only on data sets with fewer than 10 attributes. The EAST algorithm is not efficient enough to handle data sets with 100 or more attributes. The method by Guan et al. <ref type="bibr" target="#b6">[7]</ref> is efficient enough to deal with the WebKb data, but it produces only two partitions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">A Remark</head><p>The evaluation method used in Section 5.1 has one drawback. A naïve method that generates a huge number of random partitions might get good evaluation. So, it is important to test MPC methods on real-world data set and see whether they can help find meaningful clusters. On the WebKB data, BI found 98 partitions. By inspecting their information curves, we were able to quickly identify dozens of meaningful partitions. With the random method, however, one would have to examine a huge number of partitions before finding a meaningful one. Hence it is not useful.</p><p>to U Washington-Seattle. If such clusters are manually aggregated, the NMI values for BI would look better.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>This paper is concerned with the use of latent tree models (LTMs) for multiple partition clustering. We propose a new algorithm, called BI, for learning LTM that is much faster than the best previous algorithm and can handle data with hundreds of attributes. The key idea behind the algorithm is UD-test, which determines whether the interactions among a set of observed variables can be modeled using one latent variable. Empirical results are presented to show that BI is able to produce much richer clustering results than alternative methods.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>, suppose we use Y 43 as the root. Then the edges are directed as pointing away from Y 43 . For example, the edges Y 43 − Y 53 and Y 53 − ut should be directed as Y 43 → Y 53 and Y 53 → ut. The numerical information of the model includes a marginal distribution P (Y 43 ) for the root and one conditional distribution for each edge. For the edge Y 43 → Y 53 , we have distribution P (Y 53 |Y 43 ); for the edge Y 53 → ut, we have distribution P (ut|Y 53 )</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Part of the latent tree model produced by new algorithm on the WebKB data.</figDesc><graphic coords="3,140.65,106.41,343.02,90.64" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. An example of sibling cluster determination Step 2: A latent class model (LCM) is an LTM with only one latent node.Figure 2 (a) and (b) show two examples. It is a commonly used finite mixture model</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 (</head><label>2</label><figDesc>a) and (b) show two examples. It is a commonly used finite mixture model for discrete data. At Step 2, BI learns an LCM for each sibling cluster. There are two substasks: (1) to determine the cardinality of the latent variable and</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Generative model and feature curves of partitions. the feature curves of the BI partitions. They match those of the true partitions almost perfectly. Based on the feature curves, we matched up the BI partitions with the true partitions, and computed the NMI for each pair. The NMI values are 0.91(B1,Y1), 0.86(B2,Y2) and 0.98(B3,Y3) respectively. Those indicate that BI has recovered the true partitions well.</figDesc><graphic coords="7,308.06,108.26,144.51,70.55" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Information curves of latent variables Y51 and Y54.</figDesc><graphic coords="8,162.97,105.39,286.31,88.19" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>The clusters Y 51 =2 and Y 51 =3 together correspond to U Wisconsin-Madison. Y 54 represents another partition based on austin, texas, utexas and ut. The cluster Y 54 =2 corresponds to U Texas-Austin.</figDesc><table><row><cell>Y47: IC=94% cluster 1 2 cornell .04 1 ithaca 0 .47 ny .01 .39 Size .84 .16</cell><cell>Y48: IC=94% 1 2 washington .03 1 cluster seattle .01 .13 1 3 .98 wa 0 0 .92 Size .8 .1 .1</cell><cell>Y51: IC=98% cluster 1 2 wisc 0 .86 .96 3 madison .01 .33 .98 wi 0 .03 .92 dayton 0 .04 .92 wisconsin 0 .31 .94 Size .74 .14 .12</cell><cell>Y54: IC=99% cluster 1 2 austin 0 .86 texas .01 .33 utexas 0 .03 ut 0 .04 Size .82 .18</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">Here m1 and m2 include model parameter values.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">As seen in Section 4.2, some of the partitions obtained by BI contain multiple clusters that correspond to a true class. For example, both Y48 = 2 and Y48 = 3 correspond</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>7 Acknowledgements Research on this work was supported by Hong Kong Research Grants Council GRF Grant #622408, The National Basic Research Program of China (aka the 973 Program) under project No. 2011CB505101, and HKUST Fok Ying Tung Graduate School.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Non-reduntant multi-view clustering via orthogonalization</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Cui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">Z</forename><surname>Fern</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">G</forename><surname>Dy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDM-07</title>
				<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A principled and flexible framework for finding alternative clusterings</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Davidson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">KDD-09</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Non-redundant data clustering</title>
		<author>
			<persName><forename type="first">D</forename><surname>Gondek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hofmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">KAIS</title>
		<imprint>
			<biblScope unit="volume">07</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Simultaneous unsupervised learning of disparate clusterings</title>
		<author>
			<persName><forename type="first">P</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Meka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">S</forename><surname>Dhillon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SDM</title>
		<imprint>
			<biblScope unit="volume">08</biblScope>
			<biblScope unit="page" from="858" to="869" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Multiple non-redundant spectral clustering views</title>
		<author>
			<persName><forename type="first">D</forename><surname>Niu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">G</forename><surname>Dy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Jordan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICML-10</title>
				<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Model-based methods to identify multiple cluster structures in a data set</title>
		<author>
			<persName><forename type="first">G</forename><surname>Galimberti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Soffritti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CSDA</title>
		<imprint>
			<biblScope unit="volume">07</biblScope>
			<biblScope unit="page" from="520" to="536" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Variational inference for nonparametric multiple clustering</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Guan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">G</forename><surname>Dy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Niu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ghahramani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MultiClust Workshop</title>
				<imprint>
			<publisher>KDD</publisher>
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Hierarchical latent class models for cluster analysis</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">L</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">JMLR</title>
		<imprint>
			<biblScope unit="volume">04</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="697" to="723" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Variable selection in model-based clustering: To do or to facilitate</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">K M</forename><surname>Poon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICML-10</title>
				<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Estimating the dimension of a model</title>
		<author>
			<persName><forename type="first">G</forename><surname>Schwarz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Annals of Statistics</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="461" to="464" />
			<date type="published" when="1978">1978</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Efficient model evaluation in the search-based approach to latent structure discovery</title>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PGM</title>
		<imprint>
			<biblScope unit="volume">08</biblScope>
			<biblScope unit="page" from="57" to="64" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Greedy learning of binary latent trees</title>
		<author>
			<persName><forename type="first">S</forename><surname>Harmeling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K I</forename><surname>Williams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">TPAMI</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Learning latent tree graphical models</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">Y F</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Anandkumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Willsky</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
		<respStmt>
			<orgName>Computing Research Repository</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Probabilistic Graphical Models: Principles and Techniques</title>
		<author>
			<persName><forename type="first">D</forename><surname>Koller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Friedman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>MIT Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Approximating discrete probability distributions with dependence trees</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K</forename><surname>Chow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">N</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Information Theory</title>
		<imprint>
			<date type="published" when="1968">1968</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Cluster ensembles -a knowledge reuse framework for combining multiple partitions</title>
		<author>
			<persName><forename type="first">A</forename><surname>Strehl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ghosh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">JMLR</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="583" to="617" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
