1. INTRODUCTION

Hierarchical Clustering on HDP Topics to build a Semantic Tree from Text

Jianfeng Si Qing Li

jianfsi2@student.cityu.edu.hk itqli@cityu.edu.hk 0

Tieyun Qian

qty@whu.edu.cn 2

Xiaotie Deng

csdeng@cityu.edu.hk 1 0 Dept. of Computer Science Dept. of Computer Science, City University of Hong Kong City University of Hong Kong , Hong Kong , China Hong Kong, China 1 Dept. of Computer Science, City University of Hong Kong , Hong Kong , China 2 State Key Lab of Software, Eng., Wuhan University , Wuhan , China

An ideal semantic representation of text corpus should exhibit a hierarchical topic tree structure, and topics residing at di erent node levels of the tree should exhibit di erent levels of semantic abstraction( i.e., the deeper level a topic resides, the more speci c it would be). Instead of learning every node directly which is a quite time consuming task, our approach bases on a nonparametric Bayesian topic model, namely, Hierarchical Dirichlet Processes (HDP). By tuning on the topic's Dirichlet scale parameter settings, two topic sets of di erent levels of abstraction are learned from the HDP separately and further integrated into a hierarchical clustering process. We term our approach as HDP Clustering(HDP-C). During the hierarchical clustering process, a lower level of speci c topics are clustered into a higher level of more general topics in an agglomerative style to get the nal topic tree. Evaluation of the tree quality on several real world datasets demonstrates its competitive performance.

1. INTRODUCTION

The ever-increasing explosion of online unstructured information puts forward a strong demand to organize online resources in a more e cient way. Hierarchical structures are widely used in knowledge representation, resource organization or document indexing. For example, the web directories organize web pages into a hierarchical tree, providing a comprehensive navigation tool. The discovery of such rich semantic hierarchies from raw data collections becomes a fundamental research in data analysis.

In this paper, we aim to learn a semantic representation of text corpus in the form of a topic tree structure. This can VLDS’12 August 31, 2012. Istanbul, Turkey.

Copyright c 2012 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. be regarded as a kind of high level summarization on the content of any document collection, as a topic tree expresses a shared conceptualization of interests in certain domain. Such a topic tree functions as an outline to help readers get the main idea of the document collection, which works similarly to the table of content(TOC) of a printed book.

Instead of learning every node directly which is a quite time consuming task, we treat the the construction process of the topic hierarchy mainly as a two-phase task: 1) the identi cation or de nition of topics; 2) the derivation of hierarchical relationships between or among the topics. Our approach is built on a nonparametric Bayesian topic model, namely, Hierarchical Dirichlet Processes(HDP)[ 13 ]. By tuning on the topic's Dirichlet scale parameter settings, two topic sets are learned from the HDP separately with different levels of semantic abstraction. One as the top level which represents a small collection of general topics and another as the down level which corresponds to a relatively larger collection of speci c topics. Topics from the two different sets exhibit di erent topic granularity on semantic representation. Based on these, we can e ciently construct the \middle" level topics directly without modeling them explicitly. As a result, the hierarchical structure comes out straightforwardly and the whole learning process speeds up.

Fig.1 shows a sub-tree of our learned topic tree on the JACM1 dataset which contains 536 abstracts of the Journal of the ACM from 1987-2004. There are two super topics on this sub-tree, one as system related topic and another as database related topic. When we look into the database topic, we nd that it is further divided into 3 speci c aspects, which are \Scheme Design", \DB Robust" and \Transaction Control". Also we observe that the super topic mainly contain some widely used function words or stop words, resulting in the most \general" topic as the root.

The organization of the paper is as follows. In Section 2 we brie y introduce the related works. We de ne in Section 3 our problem formulation and propose the HDP-C model. Our experiment on several real world datasets is presented in Section 4, and we conclude our work in Section 5. 1http://www.cs.princeton.edu/ blei/downloads/jacm.tgz

11 consensus protocol failure faults tolerated

124 processor protocol asynchronous consensus distributed

15 network communicate bound broadcast ring

0 the,of,a,is,and,i n,to,that,for 27 shared atomic register singlewriter waitfree

21 database scheme information algebra user

140 database model data schemes transactio ns

99 restriction availability concurrent replication mechanism s

53 Transactions Transaction Control Concurrency concurrent

RELATED WORK

Topic modeling as a task of document modeling has attracted much attention in recent years. But much work just focuses on inferring hidden topics as a at cluster over the term space[ 1 ]. One of the basic and also the most widely used ones is the Latent Dirichlet Allocation(LDA)[ 4 ]. LDA can learn a prede ned number of topics under a bag-oftopics. A question that comes with LDA is how many topics we should take for the model to estimate. To address this problem, another nonparametric Bayesian model, namely, Hierarchical Dirichlet Processes(HDP), was introduced and adopted[ 13 ]. In both LDA and HDP, no relationship is dened explicitly between topics during learning, and the estimated topics are \ at".

Comparing to the \ at" models, hierarchical modeling of topics can learn more accurate and predictive models, because hierarchical modeling is more likely to capture the generative nature of text collections([ 11 ],[ 9 ],[ 10 ]). The Correlated Topic Models(CTM)[ 3 ] which is an extension of LDA, captures the relations between topics, but it only models the pair-wise correlations. The Pachinko Allocation Model (PAM)[ 10 ] uses a directed acyclic graph(DAG) structure to learn and represent the topic correlations. PAM connects the words and topics on a DAG, where topics reside on the interior nodes and words reside on the leaf nodes, but PAM is unable to represent word distributions as parents of other word distributions. Zavitsanos et al.[ 16 ] learned a topic tree using LDA for each level, by starting with one topic at level 0 and incrementing the number of topics in each further iteration/level, and used the symmetric KL divergence between neighbor hierarchies to indicate the convergence. The basis of their work is quite similar to ours, but they learn the prede ned number of topics for each level explicitly. The Hierarchical Latent Dirichlet Allocation(hLDA)[ 2 ] is the rst model to learn a tree-structure topic distribution. In hLDA, each document spans a single path starting from the root node to a leaf node of the tree with a prede ned depth, then words of that document are generated via topics on that path. This model arranges the topics into a tree, with the desideratum that more general topics should appear near the root and more specialized topics should appear near the leaves[ 8 ].

Hierarchical HDP(hHDP)[ 15 ], on the other hand, learns a topic hierarchy from text by de ning an HDP for each level of the topic hierarchy. The topic hierarchy is learned in a bottom-up fashion: starting with a document corpus, the leaf topics are inferred rst, then, the word distributions of all leaf topics make up the observations for the estimation of the next up level. The procedure repeats until the root topic is inferred. In hHDP, the parent/child relationships between up/down topics are not clearly identi ed. Also, this recursive de nition of HDP is likely to su er from the low time e ciency.

Our work is also built on HDP, but only for the most root level and lowest level topics. We construct the interior level topics by a simple clustering algorithm which is quite e cient, and an evaluation on the nal tree quality also demonstrates its competitive performance. Di erent from traditional hierarchical clustering, which gives a hierarchical partition on documents[ 12 ], points in our hierarchical clustering refer to the word distributions.

Evaluation on the learned tree is also a relevant and an interesting topic. In [ 14 ], an ontology evaluation method is proposed, and we adopt the same evaluation method for our work here due to the close relevance. 3.

HDP-C MODEL

In this section, we rstly analyze the impact of the scale parameter of Dirichlet distribution, then introduce the HDP brie y, followed by describing our clustering algorithm(HDPC) in detail. 3.1

Dirichlet distribution and its scale parameter

Our HDP-C model is built upon the HDP, the idea of which lies on tuning on topic's Dirichlet scale parameter settings, so as to help control the topic granularity which is used to model the text content. Dirichlet distribution is a multi-parameter generalization of the Beta distribution, and it de nes a distribution over distributions, i.e., the samples from a Dirichlet are distributions on some discrete probability space. The Dirichlet is in the exponential family, and is a conjugate prior to the parameters of the multinomial distribution which facilitates the inference and parameter estimation.

Let be a k-dimensional Dirichlet random variable with i 0; Pik=1 i = 1, it lies in the k-1 dimensional probability simplex with the following probability density: p( j ) =

(Pik=1 i) Qik=1 ( i) 1 where the parameter is a k-vector with components i > 0, and i can be interpreted as \prior observation counts" for events governed by i. Furthermore, 0 = Pi i is called the scale or concentration parameter with the base measure ( 1= 0; ; i= 0), and (x) is the Gamma function.

A frequently used special case is the symmetric Dirichlet distribution, where 1 = = k = , indicating that we have no idea of which components are more favorable in our prior knowledge, and as a result, we use a uniform base measure. The scale parameter plays an important role in controlling the variance and sparsity of the samples. For example, when = 1, the symmetric Dirichlet distribution is equivalent to a uniform distribution over the k-1 probability simplex, i.e., it is uniform over all points in its support. Values of the scale parameter above 1 prefer variates that are dense, evenly-distributed distributions, i.e., all probabilities returned are similar to each other. Values of the scale parameter below 1 prefer sparse distributions, i.e., most of the probabilities returned will be close to 0, and the vast majority of the mass will be concentrated on a few of the probabilities.

Fig.2 depicts ve samples for each di erent setting ( = 0:1; = 1; = 10) from a 10-dimensional Dirichlet distribution. Obviously, = 0:1 leads to getting samples biasing probability mass to a few components of the sampled multinomial distribution; = 1 leads to a uniform distribution, and = 10 leads to a situation that all samples are closer to each other(in another word, each component gets similar probability mass).

In a word, a smaller setting encourages fewer words to have high probability mass in each topic; thus, the posterior requires more topics to explain the data. As a result, we get relative more speci c topics. Based on this characteristic, we can further obtain two topic sets with di erent granularity measure, corresponding to the up-bound and low-bound topic sets in the sense of granularity . 3.2

Hierarchical Dirichlet Processes

HDP is a nonparametric hierarchical Bayesian model which can automatically decide the number of topics. Fig.3 shows the graphical model proposed in [ 13 ]. A global random probability measure G0 is distributed as a Dirichlet process(DP)[ 5 ] with a concentration parameter and the base probability measure H. For each document d, a probability measure Gd is drawn from another Dirichlet process with concentration parameter 0 and base probability measure G0, where:

G0j ; H Gdj 0; G0

DP ( ; H) DP ( 0; G0): (2) (3)

The Chinese restaurant franchise is a good metaphor for HDP. Assume there is a restaurant franchise holding a global shared menu of dishes across all restaurants. At each table of each restaurant, only one dish is served from the global menu selected by the rst customer who sits there, and it is shared among all customers who sit in the same table. The same dish can be served for multiple tables in multiple restaurants. For the document modeling scenario, each document corresponds to a restaurant, each word corresponds to a customer, and topics are dishes of the global shared menu. As a result, using HDP, we can nally learn a set of global topics, and each document should cover a subset of the topics.

For our particular application, the base measure H is the Dirichlet distribution over term space, i.e., H Dirichlet( ). 1: learn a low-level topic set from HDP with

M = ftl1; ; tljMjg 2: learn a top-level topic set from HDP with = 0:125, = 1:0,

N = ftu1; ; tujNjg 3: for each tli 2 M , nd its \closest" topic tuj 2 N : 4: for i = 1 to jM j step 1 do 5: tuj = argM intuj2U (D(tli; tuj)) 6: tuj:childList:add(tli) 7: tuj:nchild++ 8: end for 9: cluster the top-lever topics's children in an

agglomerative hierarchical clustering style: 10: for i = 1 to jN j step 1 do 11: while tui:nchild > 3 do 12: nd the most closest children pair (tx; ty) 13: merge (tx; ty) into a new inner topic tm 14: tui.childList.remove(tx) 15: tui.chileList.remove(ty) 16: tui.chileList.add(tm) 17: tuj:nchild 18: end while 19: end for Algorithm 1: Hierarchical clustering algorithm(HDPC) from low-level topics to the top level So, the scale parameter is used as the granularity indicator in our experiment. 3.3

Hierarchical clustering on topics

Based on the top-level topic and down-level topic sets, we use an agglomerative clustering algorithm to build up the interior nodes.

The top level topics give a raw partition on the topic distribution and can be directly combined to form the root topic node. Also it can help supervise the agglomerative clustering process for the low level topics. So, the whole algorithm is divided into three phrases: 1. assign all low level topics into their immediate top level topics; 2. for all subsets of low level topics indexed under each top level topic, an agglomerative clustering process is invoked; 3. nally, de ne the root topic node as a combination of top level topics.

So, the size of the nal topic tree is determined by the number of topics on the top level and down level, and we decide the depth of the tree afterward according to user requirement by truncating unwanted lower levels.

During the clustering process, a pair of \closest" topics are merged for each iteration. The whole algorithm is presented as Algorithm 1.

Algorithm 1 use a \bottom up" approach instead of \top down" approach. That is because we have no idea on how to split a topic distribution into two sub topics, while merging of two sub topics into one is much more straightforward.

EXPERIMENT

In this section, we set up our golden line[ 14 ] topic trees from hierarchical data collections, test the tree quality with 1 0.8 0.6 0.4 0.2 00 1 0.8 0.6 0.4 0.2 00 1 0.8 0.6 0.4 0.2 00 5 5 5 η=0.1 η=1 η=10 10 10 10

To evaluate the learned tree quality, we use the Wikipedia (WIKI) dataset from the third Pascal Challenge on Large Scale Hierarchical Text Classi cation(LSHTC3)2 and the Open Directory Project(DMOZ) dataset from Second Pascal Challenge onLarge Scale Hierarchical Text Classi cation (LSHTC2)3. Totally, we obtain three datasets from each of these two sources. All these datasets contain a hierarchy le de ning the organization of each document into a hierarchical tree structure. Each document is assigned to one or more leaf nodes. The general statistics over these datasets and the JACM one are shown in Table 1.

Given the hierarchical relationship, we randomly choose some sub-trees from it and build their corresponding golden line topic trees according to the term frequencies from documents' assignments. 4.2

Scale effect of settings on HDP

The scale parameter is used as the granularity indicator in our experiment. Fig.4 shows how the count of topics learned from HDP on our datasets changes under di erent 2http://lshtc.iit.demokritos.gr/LSHTC3 DATASETS 3http://lshtc.iit.demokritos.gr/LSHTC2 datasets η=1 η=10 10 10 10 settings. Fig.5 shows the distribution variances of the learned topic collection from HDP on our datasets. For each setting, the inner variance of that learned topic collection is measured by taking an average on the symmetric KL divergence between every topic and the centroid distribution of that collection. As shown, the variances almost drop consistently while the ranges from 0.1 to 1.0. This observation is consistent with the 's consideration. 4.3

Evaluation method

Given the learned topic tree and the golden line topic tree, we want to measure how close these two structures are in a quantitative metric. We use the ontology evaluation method proposed in [ 14 ], which can capture, quite accurately, the deviations of learned structure from the gold line by means of ontology alignment techniques. In this method, the ontology 0.2 0.3 0.4 0.5 0.6 0.7 topic scale parameter η setting 0.8 0.9 1 concepts are de ned as vector space representations which are the same as ours. We summarize this method as follows: 1. set up a one-to-one matching collection M = f1; ; jM jg based on the dissimilarity measure between nodes of the learned tree L = ftl1; ; tljLjg and nodes of the golden tree G = ftg1; ; tgjGjg, where jM j = smaller(jGj; jLj); 2. for each matching m = (tli; tgj ), compute the Probabilistic Cotopy Precision(P CPm) and Probabilistic Cotopy Recall(P CRm); 3. take a weighted average of PCP and PCR to compute the P, R and the F-score, the weight is the similarity between the nodes of the matching pair.

The corresponding formulas needed for steps 2 and 3 above are shown in the following:

P CPm = jCS(tli) T CS(tgj )j

jCS(tli)j P CRm = jCS(tli) T CS(tgj )j

jCS(tgj )j T V D = 1 X jp(i) 2

i P = R =

1 XjMj (1 jM j m=1 1 XjMj (1 jM j m=1 q(i)j; T V D 2 [0; 1]: T V Di)P CPm

T V Di)P CRm F =

P R

P + R

In above equations, the CS(t) is the Cotopy Set of node t, which includes all its direct and indirect super and subtopics (4) (5) (6) (7) (8) (9) skl h

Gibbs et al.[ 6 ] reviewed on ten of the most popular probability metrics/distances used by statisticians and probabilists. We choose four of these and test their in uence on our learned tree quality. The selected metrics are symmetric KL divergence(Dskl), Hellinger distance(Dh), and symmetric 2 distance(Ds 2 ) whose de nitions are given below, as well as the TVD(Dtvd) de ned in Equation.6.

i=1 Dskl(p; q) = 1=2[KL(p; q) + KL(q; p)]

V KL(p; q) = X pilog( pi )

V Dh(p; q) = (X(ppi i=1

pqi)2)1=2 i=1 Ds 2 (p; q) = 1=2[D 2 (p; q) + D 2 (q; p)]

V D 2 (p; q) = X (pi qi qi)2

Fig.6 plots the F-Scores of our learned topic trees from all the datasets with di erent distance measures. As observed from this gure, all choices perform similarly, so we choose the symmetric KL divergence as our distance measure of the clustering in further experiment. The relationship between those measures can be found in [ 6 ]. 4.5

Performance of HDP-C

We use hLDA as the baseline to learn a 4-depth topic tree with the scale parameter settings = 1:0; = 0:5; = 0:25, and = 0:125 for each level. This is then compared to the top 4-depth sub-tree of our learned tree through HDP-C. To be consistent with hLDA's settings, the top-level topics are learned with = 1:0 and down-level topics are learned with = 0:125. We use the default value for other parameters: = 1:0; 0 = 1:0, and the max iteration is 1000. For hLDA, we set the max iteration to be 2000 due to that it gets a bigger learning space than HDP. The evaluation result is given in Table.2(Note that JACM dataset is not included here due to the lack of golden line topic tree). In terms of the F-Score, our approach performs, on average, 12.1% better on wiki datasets and 25.3% better on dmoz datasets. One reason is that, for hLDA each document only spans a single path from the root to a leaf node, which is a quite tough restriction in the mixture of topics for each document. In contrast, our approach does not make any prior restriction on each document's topic choice. Actually, each document can span any arbitrary sub-tree, which can explain its generative nature.

Besides, we observe from Table 2 that the improvement in terms of P is much better that R, which indicates that our approach is more preferable to those tasks which care the precision more. 5.

CONCLUSIONS

This paper builds a semantic topic tree representation for a document collection based on a non-parametric Bayesian topic model. Only the up-bound and low-bound topic sets are directly inferred with the tuning on topic's Dirichlet scale parameter for di erent levels of abstraction. A hierarchical clustering algorithm(HDP-C) is proposed to derive the middle level topics in order to construct the nal topic tree. Our experimental study on several real world datasets shows competitive performance of our approach.

ACKNOWLEDGMENTS

The work described in this paper has been supported by the NSFC Overseas, HongKong & Macao Scholars Collaborated Researching Fund (61028003) and the Specialized Research Fund for the Doctoral Program of Higher Education, China (20090141120050).

[1]

D. M.

Blei . Introduction to probabilistic topic models . Communications of the ACM , 2011 .

[2]

D. M.

Blei , T. L. Gri ths, and M. I. Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies . J. ACM , 57 :7: 1 {7: 30 , February 2010 .

[3]

D. M.

Blei and J. D. La erty. A correlated topic model of science . AAS , 1 ( 1 ): 17 { 35 , 2007 .

[4]

D. M.

Blei ,

A. Y.

Ng , and

M. I.

Jordan . Latent dirichlet allocation . J. Mach. Learn. Res. , 3 : 993 { 1022 , Mar . 2003 .

[5]

T. S.

Ferguson . A Bayesian Analysis of Some Nonparametric Problems . The Annals of Statistics , 1 ( 2 ): 209 { 230 , 1973 .

[6]

A. L.

Gibbs , Francis, and

Su . On choosing and bounding probability metrics . Internat. Statist. Rev. , pages 419 { 435 , 2002 .

[7]

A. L.

Gibbs and

F. E.

Su . On Choosing and Bounding Probability Metrics . International Statistical Review , 70 : 419 { 435 , 2002 .

[8]

Hofmann . The cluster-abstraction model: unsupervised learning of topic hierarchies from text data . In Proceedings of the 16th international joint conference on Arti cial intelligence - Volume 2, IJCAI'99 , pages 682 { 687 , San Francisco, CA, USA, 1999 . Morgan Kaufmann Publishers Inc.

[9]

Li ,

Blei , and

Mccallum . Nonparametric Bayes Pachinko Allocation . In UAI 07 , 2007 .

[10]

Li and

Mccallum . Pachinko allocation: Dag-structured mixture models of topic correlations . In In Proceedings of the 23rd International Conference on Machine Learning , pages 577 { 584 , 2006 .

[11]

Mimno ,

Li , and

Mccallum . Mixtures of hierarchical topics with pachinko allocation . In In Proceedings of the 24th International Conference on Machine Learning , pages 633 { 640 , 2007 .

[12]

Murtagh . A Survey of Recent Advances in Hierarchical Clustering Algorithms . The Computer Journal , 26 ( 4 ): 354 { 359 , 1983 .

[13]

Y. W.

Teh ,

M. I.

Jordan ,

M. J.

Beal , and

D. M.

Blei . Hierarchical Dirichlet processes . Journal of the American Statistical Association , 101 ( 476 ): 1566 { 1581 , 2006 .

[14]

Zavitsanos , G. Paliouras, and

Vouros . Gold standard evaluation of ontology learning methods through ontology transformation and alignment. Knowledge and Data Engineering , IEEE Transactions on, 23 ( 11 ): 1635 {1648, nov. 2011 .

[15]

Zavitsanos , G. Paliouras, and

G. A.

Vouros . Non-parametric estimation of topic hierarchies from texts with hierarchical dirichlet processes . J. Mach. Learn. Res. , 999888 : 2749 { 2775 , Nov . 2011 .

[16]

Zavitsanos ,

Petridis , G. Paliouras, and

G. A.

Vouros . Determining automatically the size of learned ontologies . In Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Arti cial Intelligence , pages 775 { 776 , Amsterdam, The Netherlands, 2008 . IOS Press.