Hierarchical Clustering on HDP Topics to build a Semantic
                      Tree from Text

                                             Jianfeng Si                                Qing Li
                                  Dept. of Computer Science                 Dept. of Computer Science
                                  City University of Hong Kong              City University of Hong Kong
                                       Hong Kong, China                          Hong Kong, China
                                 jianfsi2@student.cityu.edu.hk itqli@cityu.edu.hk
                                         Tieyun Qian              Xiaotie Deng
                                    State Key Lab of Software               Dept. of Computer Science
                                              Eng.                          City University of Hong Kong
                                        Wuhan University                         Hong Kong, China
                                          Wuhan, China                       csdeng@cityu.edu.hk
                                        qty@whu.edu.cn

ABSTRACT                                                                    be regarded as a kind of high level summarization on the
An ideal semantic representation of text corpus should ex-                  content of any document collection, as a topic tree expresses
hibit a hierarchical topic tree structure, and topics residing              a shared conceptualization of interests in certain domain.
at different node levels of the tree should exhibit different               Such a topic tree functions as an outline to help readers
levels of semantic abstraction( i.e., the deeper level a topic              get the main idea of the document collection, which works
resides, the more specific it would be). Instead of learn-                  similarly to the table of content(TOC) of a printed book.
ing every node directly which is a quite time consuming                        Instead of learning every node directly which is a quite
task, our approach bases on a nonparametric Bayesian topic                  time consuming task, we treat the the construction process
model, namely, Hierarchical Dirichlet Processes (HDP). By                   of the topic hierarchy mainly as a two-phase task: 1) the
tuning on the topic’s Dirichlet scale parameter settings, two               identification or definition of topics; 2) the derivation of hi-
topic sets of different levels of abstraction are learned from              erarchical relationships between or among the topics. Our
the HDP separately and further integrated into a hierar-                    approach is built on a nonparametric Bayesian topic model,
chical clustering process. We term our approach as HDP                      namely, Hierarchical Dirichlet Processes(HDP)[13]. By tun-
Clustering(HDP-C). During the hierarchical clustering pro-                  ing on the topic’s Dirichlet scale parameter settings, two
cess, a lower level of specific topics are clustered into a higher          topic sets are learned from the HDP separately with dif-
level of more general topics in an agglomerative style to get               ferent levels of semantic abstraction. One as the top level
the final topic tree. Evaluation of the tree quality on sev-                which represents a small collection of general topics and an-
eral real world datasets demonstrates its competitive perfor-               other as the down level which corresponds to a relatively
mance.                                                                      larger collection of specific topics. Topics from the two dif-
                                                                            ferent sets exhibit different topic granularity on semantic
1.    INTRODUCTION                                                          representation. Based on these, we can efficiently construct
   The ever-increasing explosion of online unstructured in-                 the “middle” level topics directly without modeling them
formation puts forward a strong demand to organize online                   explicitly. As a result, the hierarchical structure comes out
resources in a more efficient way. Hierarchical structures are              straightforwardly and the whole learning process speeds up.
widely used in knowledge representation, resource organiza-
tion or document indexing. For example, the web directo-                      Fig.1 shows a sub-tree of our learned topic tree on the
ries organize web pages into a hierarchical tree, providing a               JACM1 dataset which contains 536 abstracts of the Journal
comprehensive navigation tool. The discovery of such rich                   of the ACM from 1987-2004. There are two super topics
semantic hierarchies from raw data collections becomes a                    on this sub-tree, one as system related topic and another
fundamental research in data analysis.                                      as database related topic. When we look into the database
   In this paper, we aim to learn a semantic representation                 topic, we find that it is further divided into 3 specific aspects,
of text corpus in the form of a topic tree structure. This can              which are “Scheme Design”, “DB Robust” and “Transaction
                                                                            Control”. Also we observe that the super topic mainly con-
                                                                            tain some widely used function words or stop words, result-
                                                                            ing in the most “general” topic as the root.
                                                                              The organization of the paper is as follows. In Section 2
                                                                            we briefly introduce the related works. We define in Section
                                                                            3 our problem formulation and propose the HDP-C model.
                                                                            Our experiment on several real world datasets is presented
VLDS’12 August 31, 2012. Istanbul, Turkey.                                  in Section 4, and we conclude our work in Section 5.
Copyright c 2012 for the individual papers by the papers’ authors. Copy-
ing permitted for private and academic purposes. This volume is published   1
and copyrighted by its editors.                                                 http://www.cs.princeton.edu/∼blei/downloads/jacm.tgz
	
  
                                                         0	
                                                                                Hierarchical HDP(hHDP)[15], on the other hand, learns a
                                                 the,of,a,is,and,i
                                                   n,to,that,for	
                                                                       topic hierarchy from text by defining an HDP for each level
                                                                                                                                         of the topic hierarchy. The topic hierarchy is learned in a
                               124	
                                                  140	
  
                                                                                                                                         bottom-up fashion: starting with a document corpus, the
                            processor	
                                             database	
  
                             protocol	
                                              model	
  
                                                                                                                                         leaf topics are inferred first, then, the word distributions of
                          asynchronous	
                                              data	
  
                            consensus	
  
                                                                                    schemes	
  
                                                                                                                                         all leaf topics make up the observations for the estimation
                           distributed	
  
                                                                                   transactio                              53	
  
                                                                                       ns	
                           Transactions	
  
                                                                                                                                         of the next up level. The procedure repeats until the root
                                                                                                                      Transaction	
  
                                                                                                                         Control	
       topic is inferred. In hHDP, the parent/child relationships
                                                                                                                      Concurrency	
  
             11	
  
        consensus	
  
                                                  27	
  
                                                shared	
  
                                                                                                                       concurrent	
      between up/down topics are not clearly identified. Also,
         protocol	
                                                          21	
  
          failure	
  
                                                atomic	
  
                                               register	
                database	
                                                      this recursive definition of HDP is likely to suffer from the
           faults	
            15	
          singlewriter	
               scheme	
                     99	
  
         tolerated	
        network	
  
                         communicate	
  
                                               waitfree	
              information	
               restriction	
                         low time efficiency.
                                                                          algebra	
                availability	
  
                             bound	
  
                           broadcast	
  
                                                                            user	
                 concurrent	
                             Our work is also built on HDP, but only for the most
                                                                                                   replication	
  
                              ring	
  
                                                                                                   mechanism                             root level and lowest level topics. We construct the interior
                                                                                                        s	
  
                                                                                                                                         level topics by a simple clustering algorithm which is quite
                                                                                                                                         efficient, and an evaluation on the final tree quality also
                                                                                                                                         demonstrates its competitive performance. Different from
                                                                                                                                         traditional hierarchical clustering, which gives a hierarchi-
                                                                                                                                         cal partition on documents[12], points in our hierarchical
       Figure 1: A sub-tree learned from JACM dataset                                                                                    clustering refer to the word distributions.
                                                                                                                                            Evaluation on the learned tree is also a relevant and an
                                                                                                                                         interesting topic. In [14], an ontology evaluation method is
2.            RELATED WORK                                                                                                               proposed, and we adopt the same evaluation method for our
                                                                                                                                         work here due to the close relevance.
   Topic modeling as a task of document modeling has at-
tracted much attention in recent years. But much work just
focuses on inferring hidden topics as a flat cluster over the                                                                            3.    HDP-C MODEL
term space[1]. One of the basic and also the most widely                                                                                   In this section, we firstly analyze the impact of the scale
used ones is the Latent Dirichlet Allocation(LDA)[4]. LDA                                                                                parameter of Dirichlet distribution, then introduce the HDP
can learn a predefined number of topics under a bag-of-                                                                                  briefly, followed by describing our clustering algorithm(HDP-
topics. A question that comes with LDA is how many topics                                                                                C) in detail.
we should take for the model to estimate. To address this
problem, another nonparametric Bayesian model, namely,                                                                                   3.1    Dirichlet distribution and its scale param-
Hierarchical Dirichlet Processes(HDP), was introduced and                                                                                       eter η
adopted[13]. In both LDA and HDP, no relationship is de-                                                                                    Our HDP-C model is built upon the HDP, the idea of
fined explicitly between topics during learning, and the es-                                                                             which lies on tuning on topic’s Dirichlet scale parameter
timated topics are “flat”.                                                                                                               settings, so as to help control the topic granularity which is
   Comparing to the “flat” models, hierarchical modeling of                                                                              used to model the text content. Dirichlet distribution is a
topics can learn more accurate and predictive models, be-                                                                                multi-parameter generalization of the Beta distribution, and
cause hierarchical modeling is more likely to capture the gen-                                                                           it defines a distribution over distributions, i.e., the samples
erative nature of text collections([11],[9],[10]). The Corre-                                                                            from a Dirichlet are distributions on some discrete proba-
lated Topic Models(CTM)[3] which is an extension of LDA,                                                                                 bility space. The Dirichlet is in the exponential family, and
captures the relations between topics, but it only models                                                                                is a conjugate prior to the parameters of the multinomial
the pair-wise correlations. The Pachinko Allocation Model                                                                                distribution which facilitates the inference and parameter
(PAM)[10] uses a directed acyclic graph(DAG) structure to                                                                                estimation.
learn and represent the topic correlations. PAM connects                                                                                    Let θPbe a k-dimensional Dirichlet random variable with
the words and topics on a DAG, where topics reside on the                                                                                θi ≥ 0, ki=1 θi = 1, it lies in the k-1 dimensional probability
interior nodes and words reside on the leaf nodes, but PAM                                                                               simplex with the following probability density:
is unable to represent word distributions as parents of other
                                                                                                                                                                Γ( k αi ) α1 −1
                                                                                                                                                                   P
word distributions. Zavitsanos et al.[16] learned a topic tree                                                                                                                            α −1
using LDA for each level, by starting with one topic at level                                                                                         p(θ|α) = Qk i=1          θ1  · · · θk k        (1)
                                                                                                                                                                   i=1 Γ(α i )
0 and incrementing the number of topics in each further iter-
ation/level, and used the symmetric KL divergence between                                                                                where the parameter α is a k-vector with components αi > 0,
neighbor hierarchies to indicate the convergence. The ba-                                                                                and αi can be interpreted as “prior observation  P counts” for
sis of their work is quite similar to ours, but they learn the                                                                           events governed by θi . Furthermore, α0 = i αi is called
predefined number of topics for each level explicitly. The Hi-                                                                           the scale or concentration parameter with the base measure
erarchical Latent Dirichlet Allocation(hLDA)[2] is the first                                                                             (α1 /α0 , · · · , αi /α0 ), and Γ(x) is the Gamma function.
model to learn a tree-structure topic distribution. In hLDA,                                                                                A frequently used special case is the symmetric Dirich-
each document spans a single path starting from the root                                                                                 let distribution, where α1 = · · · = αk = η, indicating that
node to a leaf node of the tree with a predefined depth, then                                                                            we have no idea of which components are more favorable in
words of that document are generated via topics on that                                                                                  our prior knowledge, and as a result, we use a uniform base
path. This model arranges the topics into a tree, with the                                                                               measure. The scale parameter η plays an important role in
desideratum that more general topics should appear near                                                                                  controlling the variance and sparsity of the samples. For ex-
the root and more specialized topics should appear near the                                                                              ample, when η = 1, the symmetric Dirichlet distribution is
leaves[8].                                                                                                                               equivalent to a uniform distribution over the k-1 probabil-
ity simplex, i.e., it is uniform over all points in its support.      1: learn a low-level topic set from HDP with η = 0.125,
Values of the scale parameter above 1 prefer variates that               M = {tl1 , · · · , tl|M | }
are dense, evenly-distributed distributions, i.e., all probabil-      2: learn a top-level topic set from HDP with η = 1.0,
ities returned are similar to each other. Values of the scale            N = {tu1 , · · · , tu|N | }
parameter below 1 prefer sparse distributions, i.e., most of          3: for each tli ∈ M , find its “closest” topic tuj ∈ N :
the probabilities returned will be close to 0, and the vast           4: for i = 1 to |M | step 1 do
majority of the mass will be concentrated on a few of the             5:    tuj = argM intuj ∈U (D(tli , tuj ))
probabilities.                                                        6:    tuj .childList.add(tli )
   Fig.2 depicts five samples for each different η setting (η =       7:    tuj .nchild++
0.1, η = 1, η = 10) from a 10-dimensional Dirichlet distri-           8: end for
bution. Obviously, η = 0.1 leads to getting samples biasing           9: cluster the top-lever topics’s children in an
probability mass to a few components of the sampled multi-               agglomerative hierarchical clustering style:
nomial distribution; η = 1 leads to a uniform distribution,          10: for i = 1 to |N | step 1 do
and η = 10 leads to a situation that all samples are closer          11:    while tui .nchild > 3 do
to each other(in another word, each component gets similar           12:       find the most closest children pair (tx , ty )
probability mass).                                                   13:       merge (tx , ty ) into a new inner topic tm
   In a word, a smaller η setting encourages fewer words to          14:       tui .childList.remove(tx )
have high probability mass in each topic; thus, the posterior        15:       tui .chileList.remove(ty )
requires more topics to explain the data. As a result, we get        16:       tui .chileList.add(tm )
relative more specific topics. Based on this characteristic,         17:       tuj .nchild − −
we can further obtain two topic sets with different granular-        18:    end while
ity measure, corresponding to the up-bound and low-bound             19: end for
topic sets in the sense of granularity .
                                                                    Algorithm 1: Hierarchical clustering algorithm(HDP-
3.2    Hierarchical Dirichlet Processes                             C) from low-level topics to the top level


                                                                   So, the scale parameter η is used as the granularity indicator
                                                                   in our experiment.

                                                                   3.3      Hierarchical clustering on topics
                                                                      Based on the top-level topic and down-level topic sets, we
                                                                   use an agglomerative clustering algorithm to build up the
                                                                   interior nodes.
            Figure 3: HDP graphical model                             The top level topics give a raw partition on the topic dis-
                                                                   tribution and can be directly combined to form the root
                                                                   topic node. Also it can help supervise the agglomerative
  HDP is a nonparametric hierarchical Bayesian model which
                                                                   clustering process for the low level topics. So, the whole
can automatically decide the number of topics. Fig.3 shows
                                                                   algorithm is divided into three phrases:
the graphical model proposed in [13]. A global random
probability measure G0 is distributed as a Dirichlet pro-               1. assign all low level topics into their immediate top level
cess(DP)[5] with a concentration parameter γ and the base                  topics;
probability measure H. For each document d, a probability
measure Gd is drawn from another Dirichlet process with                 2. for all subsets of low level topics indexed under each
concentration parameter α0 and base probability measure                    top level topic, an agglomerative clustering process is
G0 , where:                                                                invoked;
                    G0 |γ, H ∼ DP (γ, H)                    (2)         3. finally, define the root topic node as a combination of
                                                                           top level topics.
                  Gd |α0 , G0 ∼ DP (α0 , G0 ).              (3)
                                                                   So, the size of the final topic tree is determined by the num-
   The Chinese restaurant franchise is a good metaphor for
                                                                   ber of topics on the top level and down level, and we decide
HDP. Assume there is a restaurant franchise holding a global
                                                                   the depth of the tree afterward according to user require-
shared menu of dishes across all restaurants. At each table
                                                                   ment by truncating unwanted lower levels.
of each restaurant, only one dish is served from the global
                                                                      During the clustering process, a pair of “closest” topics are
menu selected by the first customer who sits there, and it
                                                                   merged for each iteration. The whole algorithm is presented
is shared among all customers who sit in the same table.
                                                                   as Algorithm 1.
The same dish can be served for multiple tables in multiple
                                                                      Algorithm 1 use a “bottom up” approach instead of “top
restaurants. For the document modeling scenario, each doc-
                                                                   down” approach. That is because we have no idea on how to
ument corresponds to a restaurant, each word corresponds
                                                                   split a topic distribution into two sub topics, while merging
to a customer, and topics are dishes of the global shared
                                                                   of two sub topics into one is much more straightforward.
menu. As a result, using HDP, we can finally learn a set of
global topics, and each document should cover a subset of
the topics.                                                        4.     EXPERIMENT
   For our particular application, the base measure H is the          In this section, we set up our golden line[14] topic trees
Dirichlet distribution over term space, i.e., H ∼ Dirichlet(η).    from hierarchical data collections, test the tree quality with
                 1                              1                           1                                           1                                   1
                                  η=0.1                       η=0.1                               η=0.1                                       η=0.1                                η=0.1
                0.8                            0.8                         0.8                                         0.8                                 0.8

                0.6                            0.6                         0.6                                         0.6                                 0.6

                0.4                            0.4                         0.4                                         0.4                                 0.4

                0.2                            0.2                         0.2                                         0.2                                 0.2

                 0                              0                           0                                           0                                   0
                      0     5             10         0    5           10         0   5                            10         0         5              10         0       5                 10


                 1                              1                           1                                           1                                   1
                                   η=1                         η=1                                          η=1                                 η=1                                 η=1
                0.8                            0.8                         0.8                                         0.8                                 0.8

                0.6                            0.6                         0.6                                         0.6                                 0.6

                0.4                            0.4                         0.4                                         0.4                                 0.4

                0.2                            0.2                         0.2                                         0.2                                 0.2

                 0                              0                           0                                           0                                   0
                      0     5             10         0    5           10         0   5                            10         0         5              10         0       5                 10


                 1                              1                           1                                           1                                   1
                                  η=10                        η=10                                    η=10                                    η=10                                 η=10
                0.8                            0.8                         0.8                                         0.8                                 0.8

                0.6                            0.6                         0.6                                         0.6                                 0.6

                0.4                            0.4                         0.4                                         0.4                                 0.4

                0.2                            0.2                         0.2                                         0.2                                 0.2

                 0                              0                           0                                           0                                   0
                      0     5             10         0    5           10         0   5                            10         0         5              10         0       5                 10


Figure 2: Samples of Dirichlet distributions with different scale parameter settings (x-axis are the components
[1-10], y-axis is the probability)


         Table 1: General Statistics on Data sets                                                           120
    Dataset    rootId     #Docs      TermSpace           #Golden Topics                                                                                                                    jacm
                                                                                                                                                                                           wiki1
    JACM        N/A        536          1809                 N/A                                                                                                                           wiki2
     wiki1    2020309      748          7032                  36                                            100
                                                                                                                                                                                           wiki3
     wiki2    2337326      420          7047                  46                                                                                                                           dmoz1
     wiki3    2319090      310          5516                  29                                                                                                                           dmoz2
                                                                                                             80                                                                            dmoz3
    dmoz1      38825       781         15082                  68
                                                                                         number of topics


    dmoz2      39220       289          6290                  16
    dmoz3      38997       625         13907                  87                                             60


                                                                                                             40
different probability metrics as distance measure of our clus-
tering, and use hLDA as the baseline to compare the tree
                                                                                                             20
quality in the terms of Precision, Recall, F-Score.

4.1      Data set                                                                                             0
                                                                                                              0.1       0.2      0.3       0.4      0.5    0.6     0.7       0.8       0.9         1
   To evaluate the learned tree quality, we use the Wikipedia                                                                                topic scale parameter η
(WIKI) dataset from the third Pascal Challenge on Large
Scale Hierarchical Text Classification(LSHTC3)2 and the
Open Directory Project(DMOZ) dataset from Second Pas-                                Figure 4: Topic counts estimated by HDP vs η set-
cal Challenge onLarge Scale Hierarchical Text Classification                         tings ranging in [0.1,1]
(LSHTC2)3 . Totally, we obtain three datasets from each of
these two sources. All these datasets contain a hierarchy
file defining the organization of each document into a hier-                         η settings. Fig.5 shows the distribution variances of the
archical tree structure. Each document is assigned to one or                         learned topic collection from HDP on our datasets. For each
more leaf nodes. The general statistics over these datasets                          η setting, the inner variance of that learned topic collection
and the JACM one are shown in Table 1.                                               is measured by taking an average on the symmetric KL di-
   Given the hierarchical relationship, we randomly choose                           vergence between every topic and the centroid distribution
some sub-trees from it and build their corresponding golden                          of that collection. As shown, the variances almost drop con-
line topic trees according to the term frequencies from doc-                         sistently while the η ranges from 0.1 to 1.0. This observation
uments’ assignments.                                                                 is consistent with the η’s consideration.

4.2    Scale effect of η settings on HDP                                             4.3                      Evaluation method
   The scale parameter η is used as the granularity indica-                            Given the learned topic tree and the golden line topic tree,
tor in our experiment. Fig.4 shows how the count of topics                           we want to measure how close these two structures are in a
learned from HDP on our datasets changes under different                             quantitative metric. We use the ontology evaluation method
                                                                                     proposed in [14], which can capture, quite accurately, the
2
    http://lshtc.iit.demokritos.gr/LSHTC3 DATASETS                                   deviations of learned structure from the gold line by means of
3
    http://lshtc.iit.demokritos.gr/LSHTC2 datasets                                   ontology alignment techniques. In this method, the ontology
                                                                                                                                               0.3	
  
                                            1
                                                                                                          jacm                                0.25	
  
                                           0.9                                                            wiki1
                                                                                                          wiki2                                0.2	
  
   average distance among learned topics


                                                                                                                                                                                                                                             skl	
  

                                                                                                                              F-­‐score	
  
                                                                                                          wiki3
                                           0.8                                                            dmoz1                               0.15	
  
                                                                                                          dmoz2                                                                                                                              h	
  
                                           0.7                                                            dmoz3                                 0.1	
  
                                                                                                                                                                                                                                             sx2	
  
                                           0.6                                                                                                0.05	
                                                                                         tvd	
  
                                                                                                                                                   0	
  
                                           0.5
                                                                                                                                                             wiki1	
     wiki2	
        wiki3	
        dmoz1	
       dmoz2	
     dmoz3	
  

                                           0.4                                                                                                                                                 dataset	
  

                                           0.3
                                                                                                                            Figure 6: F-Scores of learned tree on different prob-
                                           0.2
                                                                                                                            ability metrics for hierarchical clustering distance
                                           0.1                                                                              measure
                                             0.1    0.2     0.3     0.4     0.5    0.6     0.7      0.8   0.9     1
                                                                  topic scale parameter η setting

                                                                                                                            and itself. TVD is the Total Variational Distance[7], it is
Figure 5: Average Symmetric KL distance between                                                                             used to measure the dissimilarity of two probability distri-
learned topics under different Dirichlet scale param-                                                                       butions.
eter setting)
                                                                                                                            4.4                    Probability metrics for hierarchical clus-
                                                                                                                                                   tering’s distance measure
concepts are defined as vector space representations which                                                                     Gibbs et al.[6] reviewed on ten of the most popular prob-
are the same as ours. We summarize this method as follows:                                                                  ability metrics/distances used by statisticians and proba-
                                                                                                                            bilists. We choose four of these and test their influence on
  1. set up a one-to-one matching collection M = {1, · · · , |M |}
                                                                                                                            our learned tree quality. The selected metrics are symmetric
     based on the dissimilarity measure between nodes of
                                                                                                                            KL divergence(Dskl ), Hellinger distance(Dh ), and symmet-
     the learned tree L = {tl1 , · · · , tl|L| } and nodes of the
                                                                                                                            ric χ2 distance(Dsχ2 ) whose definitions are given below, as
     golden tree G = {tg1 , · · · , tg|G| }, where
                                                                                                                            well as the TVD(Dtvd ) defined in Equation.6.
     |M | = smaller(|G|, |L|);
  2. for each matching m = (tli , tgj ), compute the Prob-                                                                                                                                                   V
                                                                                                                                                                                                             X                pi
     abilistic Cotopy Precision(P CPm ) and Probabilistic                                                                                                                            KL(p, q) =                    pi log(       )
     Cotopy Recall(P CRm );                                                                                                                                                                                  i=1
                                                                                                                                                                                                                              qi

  3. take a weighted average of PCP and PCR to compute                                                                                                     Dskl (p, q) = 1/2[KL(p, q) + KL(q, p)]                                            (10)
     the P, R and the F-score, the weight is the similarity
     between the nodes of the matching pair.                                                                                                                                      V
                                                                                                                                                                                  X √    √
The corresponding formulas needed for steps 2 and 3 above                                                                                                            Dh (p, q) = ( ( pi − qi )2 )1/2                                         (11)
                                                                                                                                                                                           i=1
are shown in the following:
                                                                                                                                                                                                             V
                                                                           T
                                                                  |CS(tli ) CS(tgj )|
                                                                                                                                                                                                             X (pi − qi )2
                                                          P CPm =                                                     (4)                                                       Dχ2 (p, q) =
                                                                       |CS(tli )|                                                                                                                            i=1
                                                                                                                                                                                                                         qi

                                                                           T                                                                               Dsχ2 (p, q) = 1/2[Dχ2 (p, q) + Dχ2 (q, p)]                                        (12)
                                                                  |CS(tli ) CS(tgj )|
                                                          P CRm =                                                     (5)      Fig.6 plots the F-Scores of our learned topic trees from all
                                                                      |CS(tgj )|
                                                                                                                            the datasets with different distance measures. As observed
                                                                                                                            from this figure, all choices perform similarly, so we choose
                                                             1X
                                                   TV D =        |p(i) − q(i)|, T V D ∈ [0, 1].                       (6)   the symmetric KL divergence as our distance measure of the
                                                             2 i                                                            clustering in further experiment. The relationship between
                                                                                                                            those measures can be found in [6].
                                                                   |M |
                                                               1 X                                                          4.5                    Performance of HDP-C
                                                      P =               (1 − T V Di )P CPm                            (7)
                                                              |M | m=1
                                                                                                                               We use hLDA as the baseline to learn a 4-depth topic tree
                                                                                                                            with the scale parameter settings η = 1.0, η = 0.5, η = 0.25,
                                                                   |M |
                                                               1 X                                                          and η = 0.125 for each level. This is then compared to the
                                                      R=                (1 − T V Di )P CRm                            (8)   top 4-depth sub-tree of our learned tree through HDP-C. To
                                                              |M | m=1
                                                                                                                            be consistent with hLDA’s η settings, the top-level topics are
                                                                                                                            learned with η = 1.0 and down-level topics are learned with
                                P ∗R                                                                                        η = 0.125. We use the default value for other parameters:
                                                           (9)        F =
                               P +R                                                                                         γ = 1.0, α0 = 1.0, and the max iteration is 1000. For hLDA,
 In above equations, the CS(t) is the Cotopy Set of node t,                                                                 we set the max iteration to be 2000 due to that it gets a
which includes all its direct and indirect super and subtopics                                                              bigger learning space than HDP.
                                   Table 2: Comparison of tree quality with baseline
                                              hLDA                     HDP − C         Enhancement(%)
                                        P        R       F       P        R     F       P     R    F
                             wiki1    0.323    0.459   0.189   0.346    0.464 0.198    7.1   1.1  4.8
                             wiki2    0.333    0.461   0.193   0.401    0.460 0.214    20.4 -0.2 10.9
                             wiki3    0.335    0.532   0.205   0.432    0.574 0.247    29.0  7.9  20.5
                             dmoz1    0.350    0.449   0.197   0.437    0.432 0.217    24.9 -3.8 10.2
                             dmoz2    0.234    0.472   0.156   0.389    0.497 0.218    66.2  5.3  39.7
                             dmoz3    0.302    0.364   0.165   0.445    0.392 0.208    47.4  7.7  26.1


   The evaluation result is given in Table.2(Note that JACM              [6] A. L. Gibbs, Francis, and E. Su. On choosing and
dataset is not included here due to the lack of golden line                  bounding probability metrics. Internat. Statist. Rev.,
topic tree). In terms of the F-Score, our approach performs,                 pages 419–435, 2002.
on average, 12.1% better on wiki datasets and 25.3% better               [7] A. L. Gibbs and F. E. Su. On Choosing and Bounding
on dmoz datasets. One reason is that, for hLDA each docu-                    Probability Metrics. International Statistical Review,
ment only spans a single path from the root to a leaf node,                  70:419–435, 2002.
which is a quite tough restriction in the mixture of topics              [8] T. Hofmann. The cluster-abstraction model:
for each document. In contrast, our approach does not make                   unsupervised learning of topic hierarchies from text
any prior restriction on each document’s topic choice. Actu-                 data. In Proceedings of the 16th international joint
ally, each document can span any arbitrary sub-tree, which                   conference on Artificial intelligence - Volume 2,
can explain its generative nature.                                           IJCAI’99, pages 682–687, San Francisco, CA, USA,
   Besides, we observe from Table 2 that the improvement                     1999. Morgan Kaufmann Publishers Inc.
in terms of P is much better that R, which indicates that                [9] W. Li, D. Blei, and A. Mccallum. Nonparametric
our approach is more preferable to those tasks which care                    Bayes Pachinko Allocation. In UAI 07, 2007.
the precision more.                                                     [10] W. Li and A. Mccallum. Pachinko allocation:
                                                                             Dag-structured mixture models of topic correlations.
5.   CONCLUSIONS                                                             In In Proceedings of the 23rd International Conference
                                                                             on Machine Learning, pages 577–584, 2006.
  This paper builds a semantic topic tree representation for
                                                                        [11] D. Mimno, W. Li, and A. Mccallum. Mixtures of
a document collection based on a non-parametric Bayesian
                                                                             hierarchical topics with pachinko allocation. In In
topic model. Only the up-bound and low-bound topic sets
                                                                             Proceedings of the 24th International Conference on
are directly inferred with the tuning on topic’s Dirichlet scale
                                                                             Machine Learning, pages 633–640, 2007.
parameter for different levels of abstraction. A hierarchi-
cal clustering algorithm(HDP-C) is proposed to derive the               [12] F. Murtagh. A Survey of Recent Advances in
middle level topics in order to construct the final topic tree.              Hierarchical Clustering Algorithms. The Computer
Our experimental study on several real world datasets shows                  Journal, 26(4):354–359, 1983.
competitive performance of our approach.                                [13] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
                                                                             Hierarchical Dirichlet processes. Journal of the
                                                                             American Statistical Association, 101(476):1566–1581,
6.   ACKNOWLEDGMENTS                                                         2006.
  The work described in this paper has been supported by                [14] E. Zavitsanos, G. Paliouras, and G. Vouros. Gold
the NSFC Overseas, HongKong & Macao Scholars Collabo-                        standard evaluation of ontology learning methods
rated Researching Fund (61028003) and the Specialized Re-                    through ontology transformation and alignment.
search Fund for the Doctoral Program of Higher Education,                    Knowledge and Data Engineering, IEEE Transactions
China (20090141120050).                                                      on, 23(11):1635 –1648, nov. 2011.
                                                                        [15] E. Zavitsanos, G. Paliouras, and G. A. Vouros.
                                                                             Non-parametric estimation of topic hierarchies from
7.   REFERENCES                                                              texts with hierarchical dirichlet processes. J. Mach.
                                                                             Learn. Res., 999888:2749–2775, Nov. 2011.
 [1] D. M. Blei. Introduction to probabilistic topic models.
                                                                        [16] E. Zavitsanos, S. Petridis, G. Paliouras, and G. A.
     Communications of the ACM, 2011.
                                                                             Vouros. Determining automatically the size of learned
 [2] D. M. Blei, T. L. Griffiths, and M. I. Jordan. The                      ontologies. In Proceedings of the 2008 conference on
     nested chinese restaurant process and bayesian                          ECAI 2008: 18th European Conference on Artificial
     nonparametric inference of topic hierarchies. J. ACM,                   Intelligence, pages 775–776, Amsterdam, The
     57:7:1–7:30, February 2010.                                             Netherlands, 2008. IOS Press.
 [3] D. M. Blei and J. D. Lafferty. A correlated topic
     model of science. AAS, 1(1):17–35, 2007.
 [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
     dirichlet allocation. J. Mach. Learn. Res., 3:993–1022,
     Mar. 2003.
 [5] T. S. Ferguson. A Bayesian Analysis of Some
     Nonparametric Problems. The Annals of Statistics,
     1(2):209–230, 1973.