<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hierarchical Clustering on HDP Topics to build a Semantic Tree from Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jianfeng Si Qing Li</string-name>
          <email>jianfsi2@student.cityu.edu.hk itqli@cityu.edu.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tieyun Qian</string-name>
          <email>qty@whu.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaotie Deng</string-name>
          <email>csdeng@cityu.edu.hk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science Dept. of Computer Science, City University of Hong Kong City University of Hong Kong</institution>
          ,
          <addr-line>Hong Kong</addr-line>
          ,
          <country>China Hong Kong, China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Computer Science, City University of Hong Kong</institution>
          ,
          <addr-line>Hong Kong</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>State Key Lab of Software, Eng., Wuhan University</institution>
          ,
          <addr-line>Wuhan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>An ideal semantic representation of text corpus should exhibit a hierarchical topic tree structure, and topics residing at di erent node levels of the tree should exhibit di erent levels of semantic abstraction( i.e., the deeper level a topic resides, the more speci c it would be). Instead of learning every node directly which is a quite time consuming task, our approach bases on a nonparametric Bayesian topic model, namely, Hierarchical Dirichlet Processes (HDP). By tuning on the topic's Dirichlet scale parameter settings, two topic sets of di erent levels of abstraction are learned from the HDP separately and further integrated into a hierarchical clustering process. We term our approach as HDP Clustering(HDP-C). During the hierarchical clustering process, a lower level of speci c topics are clustered into a higher level of more general topics in an agglomerative style to get the nal topic tree. Evaluation of the tree quality on several real world datasets demonstrates its competitive performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The ever-increasing explosion of online unstructured
information puts forward a strong demand to organize online
resources in a more e cient way. Hierarchical structures are
widely used in knowledge representation, resource
organization or document indexing. For example, the web
directories organize web pages into a hierarchical tree, providing a
comprehensive navigation tool. The discovery of such rich
semantic hierarchies from raw data collections becomes a
fundamental research in data analysis.</p>
      <p>In this paper, we aim to learn a semantic representation
of text corpus in the form of a topic tree structure. This can
VLDS’12 August 31, 2012. Istanbul, Turkey.</p>
      <p>Copyright c 2012 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. This volume is published
and copyrighted by its editors.
be regarded as a kind of high level summarization on the
content of any document collection, as a topic tree expresses
a shared conceptualization of interests in certain domain.
Such a topic tree functions as an outline to help readers
get the main idea of the document collection, which works
similarly to the table of content(TOC) of a printed book.</p>
      <p>
        Instead of learning every node directly which is a quite
time consuming task, we treat the the construction process
of the topic hierarchy mainly as a two-phase task: 1) the
identi cation or de nition of topics; 2) the derivation of
hierarchical relationships between or among the topics. Our
approach is built on a nonparametric Bayesian topic model,
namely, Hierarchical Dirichlet Processes(HDP)[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. By
tuning on the topic's Dirichlet scale parameter settings, two
topic sets are learned from the HDP separately with
different levels of semantic abstraction. One as the top level
which represents a small collection of general topics and
another as the down level which corresponds to a relatively
larger collection of speci c topics. Topics from the two
different sets exhibit di erent topic granularity on semantic
representation. Based on these, we can e ciently construct
the \middle" level topics directly without modeling them
explicitly. As a result, the hierarchical structure comes out
straightforwardly and the whole learning process speeds up.
      </p>
      <p>Fig.1 shows a sub-tree of our learned topic tree on the
JACM1 dataset which contains 536 abstracts of the Journal
of the ACM from 1987-2004. There are two super topics
on this sub-tree, one as system related topic and another
as database related topic. When we look into the database
topic, we nd that it is further divided into 3 speci c aspects,
which are \Scheme Design", \DB Robust" and \Transaction
Control". Also we observe that the super topic mainly
contain some widely used function words or stop words,
resulting in the most \general" topic as the root.</p>
      <p>The organization of the paper is as follows. In Section 2
we brie y introduce the related works. We de ne in Section
3 our problem formulation and propose the HDP-C model.
Our experiment on several real world datasets is presented
in Section 4, and we conclude our work in Section 5.
1http://www.cs.princeton.edu/ blei/downloads/jacm.tgz
 </p>
      <p>11 
consensus 
protocol 
failure 
faults 
tolerated </p>
      <p>124 
processor 
protocol 
asynchronous 
consensus 
distributed </p>
      <p>15 
network 
communicate 
bound 
broadcast 
ring </p>
      <p>0 
the,of,a,is,and,i
n,to,that,for 
27 
shared 
atomic 
register 
singlewriter 
waitfree </p>
      <p>21 
database 
scheme 
information 
algebra 
user </p>
      <p>140 
database 
model 
data 
schemes 
transactio
ns </p>
      <p>99 
restriction 
availability 
concurrent 
replication 
mechanism
s </p>
      <p>53 
Transactions 
Transaction 
Control 
Concurrency 
concurrent </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Topic modeling as a task of document modeling has
attracted much attention in recent years. But much work just
focuses on inferring hidden topics as a at cluster over the
term space[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. One of the basic and also the most widely
used ones is the Latent Dirichlet Allocation(LDA)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. LDA
can learn a prede ned number of topics under a
bag-oftopics. A question that comes with LDA is how many topics
we should take for the model to estimate. To address this
problem, another nonparametric Bayesian model, namely,
Hierarchical Dirichlet Processes(HDP), was introduced and
adopted[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In both LDA and HDP, no relationship is
dened explicitly between topics during learning, and the
estimated topics are \ at".
      </p>
      <p>
        Comparing to the \ at" models, hierarchical modeling of
topics can learn more accurate and predictive models,
because hierarchical modeling is more likely to capture the
generative nature of text collections([
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). The
Correlated Topic Models(CTM)[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which is an extension of LDA,
captures the relations between topics, but it only models
the pair-wise correlations. The Pachinko Allocation Model
(PAM)[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] uses a directed acyclic graph(DAG) structure to
learn and represent the topic correlations. PAM connects
the words and topics on a DAG, where topics reside on the
interior nodes and words reside on the leaf nodes, but PAM
is unable to represent word distributions as parents of other
word distributions. Zavitsanos et al.[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] learned a topic tree
using LDA for each level, by starting with one topic at level
0 and incrementing the number of topics in each further
iteration/level, and used the symmetric KL divergence between
neighbor hierarchies to indicate the convergence. The
basis of their work is quite similar to ours, but they learn the
prede ned number of topics for each level explicitly. The
Hierarchical Latent Dirichlet Allocation(hLDA)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is the rst
model to learn a tree-structure topic distribution. In hLDA,
each document spans a single path starting from the root
node to a leaf node of the tree with a prede ned depth, then
words of that document are generated via topics on that
path. This model arranges the topics into a tree, with the
desideratum that more general topics should appear near
the root and more specialized topics should appear near the
leaves[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Hierarchical HDP(hHDP)[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], on the other hand, learns a
topic hierarchy from text by de ning an HDP for each level
of the topic hierarchy. The topic hierarchy is learned in a
bottom-up fashion: starting with a document corpus, the
leaf topics are inferred rst, then, the word distributions of
all leaf topics make up the observations for the estimation
of the next up level. The procedure repeats until the root
topic is inferred. In hHDP, the parent/child relationships
between up/down topics are not clearly identi ed. Also,
this recursive de nition of HDP is likely to su er from the
low time e ciency.
      </p>
      <p>
        Our work is also built on HDP, but only for the most
root level and lowest level topics. We construct the interior
level topics by a simple clustering algorithm which is quite
e cient, and an evaluation on the nal tree quality also
demonstrates its competitive performance. Di erent from
traditional hierarchical clustering, which gives a
hierarchical partition on documents[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], points in our hierarchical
clustering refer to the word distributions.
      </p>
      <p>
        Evaluation on the learned tree is also a relevant and an
interesting topic. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], an ontology evaluation method is
proposed, and we adopt the same evaluation method for our
work here due to the close relevance.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>HDP-C MODEL</title>
      <p>In this section, we rstly analyze the impact of the scale
parameter of Dirichlet distribution, then introduce the HDP
brie y, followed by describing our clustering
algorithm(HDPC) in detail.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Dirichlet distribution and its scale parameter</title>
      <p>Our HDP-C model is built upon the HDP, the idea of
which lies on tuning on topic's Dirichlet scale parameter
settings, so as to help control the topic granularity which is
used to model the text content. Dirichlet distribution is a
multi-parameter generalization of the Beta distribution, and
it de nes a distribution over distributions, i.e., the samples
from a Dirichlet are distributions on some discrete
probability space. The Dirichlet is in the exponential family, and
is a conjugate prior to the parameters of the multinomial
distribution which facilitates the inference and parameter
estimation.</p>
      <p>Let be a k-dimensional Dirichlet random variable with
i 0; Pik=1 i = 1, it lies in the k-1 dimensional probability
simplex with the following probability density:
p( j ) =</p>
      <p>(Pik=1 i)
Qik=1 ( i) 1
where the parameter is a k-vector with components i &gt; 0,
and i can be interpreted as \prior observation counts" for
events governed by i. Furthermore, 0 = Pi i is called
the scale or concentration parameter with the base measure
( 1= 0; ; i= 0), and (x) is the Gamma function.</p>
      <p>A frequently used special case is the symmetric
Dirichlet distribution, where 1 = = k = , indicating that
we have no idea of which components are more favorable in
our prior knowledge, and as a result, we use a uniform base
measure. The scale parameter plays an important role in
controlling the variance and sparsity of the samples. For
example, when = 1, the symmetric Dirichlet distribution is
equivalent to a uniform distribution over the k-1
probability simplex, i.e., it is uniform over all points in its support.
Values of the scale parameter above 1 prefer variates that
are dense, evenly-distributed distributions, i.e., all
probabilities returned are similar to each other. Values of the scale
parameter below 1 prefer sparse distributions, i.e., most of
the probabilities returned will be close to 0, and the vast
majority of the mass will be concentrated on a few of the
probabilities.</p>
      <p>Fig.2 depicts ve samples for each di erent setting ( =
0:1; = 1; = 10) from a 10-dimensional Dirichlet
distribution. Obviously, = 0:1 leads to getting samples biasing
probability mass to a few components of the sampled
multinomial distribution; = 1 leads to a uniform distribution,
and = 10 leads to a situation that all samples are closer
to each other(in another word, each component gets similar
probability mass).</p>
      <p>In a word, a smaller setting encourages fewer words to
have high probability mass in each topic; thus, the posterior
requires more topics to explain the data. As a result, we get
relative more speci c topics. Based on this characteristic,
we can further obtain two topic sets with di erent
granularity measure, corresponding to the up-bound and low-bound
topic sets in the sense of granularity .
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Hierarchical Dirichlet Processes</title>
      <p>
        HDP is a nonparametric hierarchical Bayesian model which
can automatically decide the number of topics. Fig.3 shows
the graphical model proposed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. A global random
probability measure G0 is distributed as a Dirichlet
process(DP)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] with a concentration parameter and the base
probability measure H. For each document d, a probability
measure Gd is drawn from another Dirichlet process with
concentration parameter 0 and base probability measure
G0, where:
      </p>
      <p>G0j ; H
Gdj 0; G0</p>
      <p>DP ( ; H)
DP ( 0; G0):
(2)
(3)</p>
      <p>The Chinese restaurant franchise is a good metaphor for
HDP. Assume there is a restaurant franchise holding a global
shared menu of dishes across all restaurants. At each table
of each restaurant, only one dish is served from the global
menu selected by the rst customer who sits there, and it
is shared among all customers who sit in the same table.
The same dish can be served for multiple tables in multiple
restaurants. For the document modeling scenario, each
document corresponds to a restaurant, each word corresponds
to a customer, and topics are dishes of the global shared
menu. As a result, using HDP, we can nally learn a set of
global topics, and each document should cover a subset of
the topics.</p>
      <p>For our particular application, the base measure H is the
Dirichlet distribution over term space, i.e., H Dirichlet( ).
1: learn a low-level topic set from HDP with</p>
      <p>M = ftl1; ; tljMjg
2: learn a top-level topic set from HDP with
= 0:125,
= 1:0,</p>
      <p>N = ftu1; ; tujNjg
3: for each tli 2 M , nd its \closest" topic tuj 2 N :
4: for i = 1 to jM j step 1 do
5: tuj = argM intuj2U (D(tli; tuj))
6: tuj:childList:add(tli)
7: tuj:nchild++
8: end for
9: cluster the top-lever topics's children in an</p>
      <p>agglomerative hierarchical clustering style:
10: for i = 1 to jN j step 1 do
11: while tui:nchild &gt; 3 do
12: nd the most closest children pair (tx; ty)
13: merge (tx; ty) into a new inner topic tm
14: tui.childList.remove(tx)
15: tui.chileList.remove(ty)
16: tui.chileList.add(tm)
17: tuj:nchild
18: end while
19: end for
Algorithm 1: Hierarchical clustering
algorithm(HDPC) from low-level topics to the top level
So, the scale parameter is used as the granularity indicator
in our experiment.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Hierarchical clustering on topics</title>
      <p>Based on the top-level topic and down-level topic sets, we
use an agglomerative clustering algorithm to build up the
interior nodes.</p>
      <p>The top level topics give a raw partition on the topic
distribution and can be directly combined to form the root
topic node. Also it can help supervise the agglomerative
clustering process for the low level topics. So, the whole
algorithm is divided into three phrases:
1. assign all low level topics into their immediate top level
topics;
2. for all subsets of low level topics indexed under each
top level topic, an agglomerative clustering process is
invoked;
3. nally, de ne the root topic node as a combination of
top level topics.</p>
      <p>So, the size of the nal topic tree is determined by the
number of topics on the top level and down level, and we decide
the depth of the tree afterward according to user
requirement by truncating unwanted lower levels.</p>
      <p>During the clustering process, a pair of \closest" topics are
merged for each iteration. The whole algorithm is presented
as Algorithm 1.</p>
      <p>Algorithm 1 use a \bottom up" approach instead of \top
down" approach. That is because we have no idea on how to
split a topic distribution into two sub topics, while merging
of two sub topics into one is much more straightforward.</p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENT</title>
      <p>
        In this section, we set up our golden line[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] topic trees
from hierarchical data collections, test the tree quality with
1
0.8
0.6
0.4
0.2
00
1
0.8
0.6
0.4
0.2
00
1
0.8
0.6
0.4
0.2
00
5
5
5
η=0.1
η=1
η=10
10
10
10
      </p>
      <p>To evaluate the learned tree quality, we use the Wikipedia
(WIKI) dataset from the third Pascal Challenge on Large
Scale Hierarchical Text Classi cation(LSHTC3)2 and the
Open Directory Project(DMOZ) dataset from Second
Pascal Challenge onLarge Scale Hierarchical Text Classi cation
(LSHTC2)3. Totally, we obtain three datasets from each of
these two sources. All these datasets contain a hierarchy
le de ning the organization of each document into a
hierarchical tree structure. Each document is assigned to one or
more leaf nodes. The general statistics over these datasets
and the JACM one are shown in Table 1.</p>
      <p>Given the hierarchical relationship, we randomly choose
some sub-trees from it and build their corresponding golden
line topic trees according to the term frequencies from
documents' assignments.
4.2</p>
    </sec>
    <sec id="sec-8">
      <title>Scale effect of settings on HDP</title>
      <p>The scale parameter is used as the granularity
indicator in our experiment. Fig.4 shows how the count of topics
learned from HDP on our datasets changes under di erent
2http://lshtc.iit.demokritos.gr/LSHTC3 DATASETS
3http://lshtc.iit.demokritos.gr/LSHTC2 datasets
η=1
η=10
10
10
10
settings. Fig.5 shows the distribution variances of the
learned topic collection from HDP on our datasets. For each
setting, the inner variance of that learned topic collection
is measured by taking an average on the symmetric KL
divergence between every topic and the centroid distribution
of that collection. As shown, the variances almost drop
consistently while the ranges from 0.1 to 1.0. This observation
is consistent with the 's consideration.
4.3</p>
    </sec>
    <sec id="sec-9">
      <title>Evaluation method</title>
      <p>
        Given the learned topic tree and the golden line topic tree,
we want to measure how close these two structures are in a
quantitative metric. We use the ontology evaluation method
proposed in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which can capture, quite accurately, the
deviations of learned structure from the gold line by means of
ontology alignment techniques. In this method, the ontology
0.2
0.3
0.4 0.5 0.6 0.7
topic scale parameter η setting
0.8
0.9
1
concepts are de ned as vector space representations which
are the same as ours. We summarize this method as follows:
1. set up a one-to-one matching collection M = f1; ; jM jg
based on the dissimilarity measure between nodes of
the learned tree L = ftl1; ; tljLjg and nodes of the
golden tree G = ftg1; ; tgjGjg, where
jM j = smaller(jGj; jLj);
2. for each matching m = (tli; tgj ), compute the
Probabilistic Cotopy Precision(P CPm) and Probabilistic
Cotopy Recall(P CRm);
3. take a weighted average of PCP and PCR to compute
the P, R and the F-score, the weight is the similarity
between the nodes of the matching pair.
      </p>
      <p>The corresponding formulas needed for steps 2 and 3 above
are shown in the following:</p>
      <p>P CPm = jCS(tli) T CS(tgj )j</p>
      <p>jCS(tli)j
P CRm = jCS(tli) T CS(tgj )j</p>
      <p>jCS(tgj )j
T V D = 1 X jp(i)
2</p>
      <p>i
P =
R =</p>
      <p>1 XjMj (1
jM j m=1
1 XjMj (1
jM j m=1
q(i)j; T V D 2 [0; 1]:
T V Di)P CPm</p>
      <p>T V Di)P CRm
F =</p>
      <p>P R</p>
      <p>P + R</p>
      <p>In above equations, the CS(t) is the Cotopy Set of node t,
which includes all its direct and indirect super and subtopics
(4)
(5)
(6)
(7)
(8)
(9)
skl  
h  </p>
      <p>
        Gibbs et al.[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] reviewed on ten of the most popular
probability metrics/distances used by statisticians and
probabilists. We choose four of these and test their in uence on
our learned tree quality. The selected metrics are symmetric
KL divergence(Dskl), Hellinger distance(Dh), and
symmetric 2 distance(Ds 2 ) whose de nitions are given below, as
well as the TVD(Dtvd) de ned in Equation.6.
      </p>
      <p>i=1
Dskl(p; q) = 1=2[KL(p; q) + KL(q; p)]</p>
      <p>V
KL(p; q) = X pilog( pi )</p>
      <p>qi</p>
      <p>V
Dh(p; q) = (X(ppi
i=1</p>
      <p>pqi)2)1=2
i=1
Ds 2 (p; q) = 1=2[D 2 (p; q) + D 2 (q; p)]</p>
      <p>V
D 2 (p; q) = X (pi
qi
qi)2</p>
      <p>
        Fig.6 plots the F-Scores of our learned topic trees from all
the datasets with di erent distance measures. As observed
from this gure, all choices perform similarly, so we choose
the symmetric KL divergence as our distance measure of the
clustering in further experiment. The relationship between
those measures can be found in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
4.5
      </p>
    </sec>
    <sec id="sec-10">
      <title>Performance of HDP-C</title>
      <p>We use hLDA as the baseline to learn a 4-depth topic tree
with the scale parameter settings = 1:0; = 0:5; = 0:25,
and = 0:125 for each level. This is then compared to the
top 4-depth sub-tree of our learned tree through HDP-C. To
be consistent with hLDA's settings, the top-level topics are
learned with = 1:0 and down-level topics are learned with
= 0:125. We use the default value for other parameters:
= 1:0; 0 = 1:0, and the max iteration is 1000. For hLDA,
we set the max iteration to be 2000 due to that it gets a
bigger learning space than HDP.
The evaluation result is given in Table.2(Note that JACM
dataset is not included here due to the lack of golden line
topic tree). In terms of the F-Score, our approach performs,
on average, 12.1% better on wiki datasets and 25.3% better
on dmoz datasets. One reason is that, for hLDA each
document only spans a single path from the root to a leaf node,
which is a quite tough restriction in the mixture of topics
for each document. In contrast, our approach does not make
any prior restriction on each document's topic choice.
Actually, each document can span any arbitrary sub-tree, which
can explain its generative nature.</p>
      <p>Besides, we observe from Table 2 that the improvement
in terms of P is much better that R, which indicates that
our approach is more preferable to those tasks which care
the precision more.
5.</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSIONS</title>
      <p>This paper builds a semantic topic tree representation for
a document collection based on a non-parametric Bayesian
topic model. Only the up-bound and low-bound topic sets
are directly inferred with the tuning on topic's Dirichlet scale
parameter for di erent levels of abstraction. A
hierarchical clustering algorithm(HDP-C) is proposed to derive the
middle level topics in order to construct the nal topic tree.
Our experimental study on several real world datasets shows
competitive performance of our approach.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENTS</title>
      <p>The work described in this paper has been supported by
the NSFC Overseas, HongKong &amp; Macao Scholars
Collaborated Researching Fund (61028003) and the Specialized
Research Fund for the Doctoral Program of Higher Education,
China (20090141120050).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          .
          <article-title>Introduction to probabilistic topic models</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          , T. L. Gri ths, and
          <string-name>
            <surname>M. I. Jordan.</surname>
          </string-name>
          <article-title>The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>57</volume>
          :7:
          <issue>1</issue>
          {7:
          <fpage>30</fpage>
          ,
          <year>February 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          and
          <string-name>
            <surname>J. D.</surname>
          </string-name>
          <article-title>La erty. A correlated topic model of science</article-title>
          .
          <source>AAS</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>17</volume>
          {
          <fpage>35</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          {
          <fpage>1022</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Ferguson</surname>
          </string-name>
          .
          <article-title>A Bayesian Analysis of Some Nonparametric Problems</article-title>
          .
          <source>The Annals of Statistics</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <volume>209</volume>
          {
          <fpage>230</fpage>
          ,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Gibbs</surname>
          </string-name>
          , Francis, and
          <string-name>
            <given-names>E.</given-names>
            <surname>Su</surname>
          </string-name>
          .
          <article-title>On choosing and bounding probability metrics</article-title>
          .
          <source>Internat. Statist. Rev.</source>
          , pages
          <volume>419</volume>
          {
          <fpage>435</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Gibbs</surname>
          </string-name>
          and
          <string-name>
            <given-names>F. E.</given-names>
            <surname>Su</surname>
          </string-name>
          .
          <article-title>On Choosing and Bounding Probability Metrics</article-title>
          .
          <source>International Statistical Review</source>
          ,
          <volume>70</volume>
          :
          <fpage>419</fpage>
          {
          <fpage>435</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          .
          <article-title>The cluster-abstraction model: unsupervised learning of topic hierarchies from text data</article-title>
          .
          <source>In Proceedings of the 16th international joint conference on Arti cial intelligence - Volume 2, IJCAI'99</source>
          , pages
          <fpage>682</fpage>
          {
          <fpage>687</fpage>
          , San Francisco, CA, USA,
          <year>1999</year>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mccallum</surname>
          </string-name>
          .
          <article-title>Nonparametric Bayes Pachinko Allocation</article-title>
          .
          <source>In UAI 07</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Mccallum</surname>
          </string-name>
          .
          <article-title>Pachinko allocation: Dag-structured mixture models of topic correlations</article-title>
          .
          <source>In In Proceedings of the 23rd International Conference on Machine Learning</source>
          , pages
          <volume>577</volume>
          {
          <fpage>584</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mccallum</surname>
          </string-name>
          .
          <article-title>Mixtures of hierarchical topics with pachinko allocation</article-title>
          .
          <source>In In Proceedings of the 24th International Conference on Machine Learning</source>
          , pages
          <volume>633</volume>
          {
          <fpage>640</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Murtagh</surname>
          </string-name>
          .
          <article-title>A Survey of Recent Advances in Hierarchical Clustering Algorithms</article-title>
          .
          <source>The Computer Journal</source>
          ,
          <volume>26</volume>
          (
          <issue>4</issue>
          ):
          <volume>354</volume>
          {
          <fpage>359</fpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y. W.</given-names>
            <surname>Teh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Beal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          .
          <article-title>Hierarchical Dirichlet processes</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>101</volume>
          (
          <issue>476</issue>
          ):
          <volume>1566</volume>
          {
          <fpage>1581</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zavitsanos</surname>
          </string-name>
          , G. Paliouras, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Vouros</surname>
          </string-name>
          .
          <article-title>Gold standard evaluation of ontology learning methods through ontology transformation and alignment. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on,
          <volume>23</volume>
          (
          <issue>11</issue>
          ):
          <volume>1635</volume>
          {1648, nov.
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zavitsanos</surname>
          </string-name>
          , G. Paliouras, and
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Vouros</surname>
          </string-name>
          .
          <article-title>Non-parametric estimation of topic hierarchies from texts with hierarchical dirichlet processes</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>999888</volume>
          :
          <fpage>2749</fpage>
          {
          <fpage>2775</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zavitsanos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petridis</surname>
          </string-name>
          , G. Paliouras, and
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Vouros</surname>
          </string-name>
          .
          <article-title>Determining automatically the size of learned ontologies</article-title>
          .
          <source>In Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Arti cial Intelligence</source>
          , pages
          <fpage>775</fpage>
          {
          <fpage>776</fpage>
          , Amsterdam, The Netherlands,
          <year>2008</year>
          . IOS Press.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>