Hierarchical Clustering on HDP Topics to build a Semantic Tree from Text Jianfeng Si Qing Li Dept. of Computer Science Dept. of Computer Science City University of Hong Kong City University of Hong Kong Hong Kong, China Hong Kong, China jianfsi2@student.cityu.edu.hk itqli@cityu.edu.hk Tieyun Qian Xiaotie Deng State Key Lab of Software Dept. of Computer Science Eng. City University of Hong Kong Wuhan University Hong Kong, China Wuhan, China csdeng@cityu.edu.hk qty@whu.edu.cn ABSTRACT be regarded as a kind of high level summarization on the An ideal semantic representation of text corpus should ex- content of any document collection, as a topic tree expresses hibit a hierarchical topic tree structure, and topics residing a shared conceptualization of interests in certain domain. at different node levels of the tree should exhibit different Such a topic tree functions as an outline to help readers levels of semantic abstraction( i.e., the deeper level a topic get the main idea of the document collection, which works resides, the more specific it would be). Instead of learn- similarly to the table of content(TOC) of a printed book. ing every node directly which is a quite time consuming Instead of learning every node directly which is a quite task, our approach bases on a nonparametric Bayesian topic time consuming task, we treat the the construction process model, namely, Hierarchical Dirichlet Processes (HDP). By of the topic hierarchy mainly as a two-phase task: 1) the tuning on the topic’s Dirichlet scale parameter settings, two identification or definition of topics; 2) the derivation of hi- topic sets of different levels of abstraction are learned from erarchical relationships between or among the topics. Our the HDP separately and further integrated into a hierar- approach is built on a nonparametric Bayesian topic model, chical clustering process. We term our approach as HDP namely, Hierarchical Dirichlet Processes(HDP)[13]. By tun- Clustering(HDP-C). During the hierarchical clustering pro- ing on the topic’s Dirichlet scale parameter settings, two cess, a lower level of specific topics are clustered into a higher topic sets are learned from the HDP separately with dif- level of more general topics in an agglomerative style to get ferent levels of semantic abstraction. One as the top level the final topic tree. Evaluation of the tree quality on sev- which represents a small collection of general topics and an- eral real world datasets demonstrates its competitive perfor- other as the down level which corresponds to a relatively mance. larger collection of specific topics. Topics from the two dif- ferent sets exhibit different topic granularity on semantic 1. INTRODUCTION representation. Based on these, we can efficiently construct The ever-increasing explosion of online unstructured in- the “middle” level topics directly without modeling them formation puts forward a strong demand to organize online explicitly. As a result, the hierarchical structure comes out resources in a more efficient way. Hierarchical structures are straightforwardly and the whole learning process speeds up. widely used in knowledge representation, resource organiza- tion or document indexing. For example, the web directo- Fig.1 shows a sub-tree of our learned topic tree on the ries organize web pages into a hierarchical tree, providing a JACM1 dataset which contains 536 abstracts of the Journal comprehensive navigation tool. The discovery of such rich of the ACM from 1987-2004. There are two super topics semantic hierarchies from raw data collections becomes a on this sub-tree, one as system related topic and another fundamental research in data analysis. as database related topic. When we look into the database In this paper, we aim to learn a semantic representation topic, we find that it is further divided into 3 specific aspects, of text corpus in the form of a topic tree structure. This can which are “Scheme Design”, “DB Robust” and “Transaction Control”. Also we observe that the super topic mainly con- tain some widely used function words or stop words, result- ing in the most “general” topic as the root. The organization of the paper is as follows. In Section 2 we briefly introduce the related works. We define in Section 3 our problem formulation and propose the HDP-C model. Our experiment on several real world datasets is presented VLDS’12 August 31, 2012. Istanbul, Turkey. in Section 4, and we conclude our work in Section 5. Copyright c 2012 for the individual papers by the papers’ authors. Copy- ing permitted for private and academic purposes. This volume is published 1 and copyrighted by its editors. http://www.cs.princeton.edu/∼blei/downloads/jacm.tgz   0   Hierarchical HDP(hHDP)[15], on the other hand, learns a the,of,a,is,and,i n,to,that,for   topic hierarchy from text by defining an HDP for each level of the topic hierarchy. The topic hierarchy is learned in a 124   140   bottom-up fashion: starting with a document corpus, the processor   database   protocol   model   leaf topics are inferred first, then, the word distributions of asynchronous   data   consensus   schemes   all leaf topics make up the observations for the estimation distributed   transactio 53   ns   Transactions   of the next up level. The procedure repeats until the root Transaction   Control   topic is inferred. In hHDP, the parent/child relationships Concurrency   11   consensus   27   shared   concurrent   between up/down topics are not clearly identified. Also, protocol   21   failure   atomic   register   database   this recursive definition of HDP is likely to suffer from the faults   15   singlewriter   scheme   99   tolerated   network   communicate   waitfree   information   restriction   low time efficiency. algebra   availability   bound   broadcast   user   concurrent   Our work is also built on HDP, but only for the most replication   ring   mechanism root level and lowest level topics. We construct the interior s   level topics by a simple clustering algorithm which is quite efficient, and an evaluation on the final tree quality also demonstrates its competitive performance. Different from traditional hierarchical clustering, which gives a hierarchi- cal partition on documents[12], points in our hierarchical Figure 1: A sub-tree learned from JACM dataset clustering refer to the word distributions. Evaluation on the learned tree is also a relevant and an interesting topic. In [14], an ontology evaluation method is 2. RELATED WORK proposed, and we adopt the same evaluation method for our work here due to the close relevance. Topic modeling as a task of document modeling has at- tracted much attention in recent years. But much work just focuses on inferring hidden topics as a flat cluster over the 3. HDP-C MODEL term space[1]. One of the basic and also the most widely In this section, we firstly analyze the impact of the scale used ones is the Latent Dirichlet Allocation(LDA)[4]. LDA parameter of Dirichlet distribution, then introduce the HDP can learn a predefined number of topics under a bag-of- briefly, followed by describing our clustering algorithm(HDP- topics. A question that comes with LDA is how many topics C) in detail. we should take for the model to estimate. To address this problem, another nonparametric Bayesian model, namely, 3.1 Dirichlet distribution and its scale param- Hierarchical Dirichlet Processes(HDP), was introduced and eter η adopted[13]. In both LDA and HDP, no relationship is de- Our HDP-C model is built upon the HDP, the idea of fined explicitly between topics during learning, and the es- which lies on tuning on topic’s Dirichlet scale parameter timated topics are “flat”. settings, so as to help control the topic granularity which is Comparing to the “flat” models, hierarchical modeling of used to model the text content. Dirichlet distribution is a topics can learn more accurate and predictive models, be- multi-parameter generalization of the Beta distribution, and cause hierarchical modeling is more likely to capture the gen- it defines a distribution over distributions, i.e., the samples erative nature of text collections([11],[9],[10]). The Corre- from a Dirichlet are distributions on some discrete proba- lated Topic Models(CTM)[3] which is an extension of LDA, bility space. The Dirichlet is in the exponential family, and captures the relations between topics, but it only models is a conjugate prior to the parameters of the multinomial the pair-wise correlations. The Pachinko Allocation Model distribution which facilitates the inference and parameter (PAM)[10] uses a directed acyclic graph(DAG) structure to estimation. learn and represent the topic correlations. PAM connects Let θPbe a k-dimensional Dirichlet random variable with the words and topics on a DAG, where topics reside on the θi ≥ 0, ki=1 θi = 1, it lies in the k-1 dimensional probability interior nodes and words reside on the leaf nodes, but PAM simplex with the following probability density: is unable to represent word distributions as parents of other Γ( k αi ) α1 −1 P word distributions. Zavitsanos et al.[16] learned a topic tree α −1 using LDA for each level, by starting with one topic at level p(θ|α) = Qk i=1 θ1 · · · θk k (1) i=1 Γ(α i ) 0 and incrementing the number of topics in each further iter- ation/level, and used the symmetric KL divergence between where the parameter α is a k-vector with components αi > 0, neighbor hierarchies to indicate the convergence. The ba- and αi can be interpreted as “prior observation P counts” for sis of their work is quite similar to ours, but they learn the events governed by θi . Furthermore, α0 = i αi is called predefined number of topics for each level explicitly. The Hi- the scale or concentration parameter with the base measure erarchical Latent Dirichlet Allocation(hLDA)[2] is the first (α1 /α0 , · · · , αi /α0 ), and Γ(x) is the Gamma function. model to learn a tree-structure topic distribution. In hLDA, A frequently used special case is the symmetric Dirich- each document spans a single path starting from the root let distribution, where α1 = · · · = αk = η, indicating that node to a leaf node of the tree with a predefined depth, then we have no idea of which components are more favorable in words of that document are generated via topics on that our prior knowledge, and as a result, we use a uniform base path. This model arranges the topics into a tree, with the measure. The scale parameter η plays an important role in desideratum that more general topics should appear near controlling the variance and sparsity of the samples. For ex- the root and more specialized topics should appear near the ample, when η = 1, the symmetric Dirichlet distribution is leaves[8]. equivalent to a uniform distribution over the k-1 probabil- ity simplex, i.e., it is uniform over all points in its support. 1: learn a low-level topic set from HDP with η = 0.125, Values of the scale parameter above 1 prefer variates that M = {tl1 , · · · , tl|M | } are dense, evenly-distributed distributions, i.e., all probabil- 2: learn a top-level topic set from HDP with η = 1.0, ities returned are similar to each other. Values of the scale N = {tu1 , · · · , tu|N | } parameter below 1 prefer sparse distributions, i.e., most of 3: for each tli ∈ M , find its “closest” topic tuj ∈ N : the probabilities returned will be close to 0, and the vast 4: for i = 1 to |M | step 1 do majority of the mass will be concentrated on a few of the 5: tuj = argM intuj ∈U (D(tli , tuj )) probabilities. 6: tuj .childList.add(tli ) Fig.2 depicts five samples for each different η setting (η = 7: tuj .nchild++ 0.1, η = 1, η = 10) from a 10-dimensional Dirichlet distri- 8: end for bution. Obviously, η = 0.1 leads to getting samples biasing 9: cluster the top-lever topics’s children in an probability mass to a few components of the sampled multi- agglomerative hierarchical clustering style: nomial distribution; η = 1 leads to a uniform distribution, 10: for i = 1 to |N | step 1 do and η = 10 leads to a situation that all samples are closer 11: while tui .nchild > 3 do to each other(in another word, each component gets similar 12: find the most closest children pair (tx , ty ) probability mass). 13: merge (tx , ty ) into a new inner topic tm In a word, a smaller η setting encourages fewer words to 14: tui .childList.remove(tx ) have high probability mass in each topic; thus, the posterior 15: tui .chileList.remove(ty ) requires more topics to explain the data. As a result, we get 16: tui .chileList.add(tm ) relative more specific topics. Based on this characteristic, 17: tuj .nchild − − we can further obtain two topic sets with different granular- 18: end while ity measure, corresponding to the up-bound and low-bound 19: end for topic sets in the sense of granularity . Algorithm 1: Hierarchical clustering algorithm(HDP- 3.2 Hierarchical Dirichlet Processes C) from low-level topics to the top level So, the scale parameter η is used as the granularity indicator in our experiment. 3.3 Hierarchical clustering on topics Based on the top-level topic and down-level topic sets, we use an agglomerative clustering algorithm to build up the interior nodes. Figure 3: HDP graphical model The top level topics give a raw partition on the topic dis- tribution and can be directly combined to form the root topic node. Also it can help supervise the agglomerative HDP is a nonparametric hierarchical Bayesian model which clustering process for the low level topics. So, the whole can automatically decide the number of topics. Fig.3 shows algorithm is divided into three phrases: the graphical model proposed in [13]. A global random probability measure G0 is distributed as a Dirichlet pro- 1. assign all low level topics into their immediate top level cess(DP)[5] with a concentration parameter γ and the base topics; probability measure H. For each document d, a probability measure Gd is drawn from another Dirichlet process with 2. for all subsets of low level topics indexed under each concentration parameter α0 and base probability measure top level topic, an agglomerative clustering process is G0 , where: invoked; G0 |γ, H ∼ DP (γ, H) (2) 3. finally, define the root topic node as a combination of top level topics. Gd |α0 , G0 ∼ DP (α0 , G0 ). (3) So, the size of the final topic tree is determined by the num- The Chinese restaurant franchise is a good metaphor for ber of topics on the top level and down level, and we decide HDP. Assume there is a restaurant franchise holding a global the depth of the tree afterward according to user require- shared menu of dishes across all restaurants. At each table ment by truncating unwanted lower levels. of each restaurant, only one dish is served from the global During the clustering process, a pair of “closest” topics are menu selected by the first customer who sits there, and it merged for each iteration. The whole algorithm is presented is shared among all customers who sit in the same table. as Algorithm 1. The same dish can be served for multiple tables in multiple Algorithm 1 use a “bottom up” approach instead of “top restaurants. For the document modeling scenario, each doc- down” approach. That is because we have no idea on how to ument corresponds to a restaurant, each word corresponds split a topic distribution into two sub topics, while merging to a customer, and topics are dishes of the global shared of two sub topics into one is much more straightforward. menu. As a result, using HDP, we can finally learn a set of global topics, and each document should cover a subset of the topics. 4. EXPERIMENT For our particular application, the base measure H is the In this section, we set up our golden line[14] topic trees Dirichlet distribution over term space, i.e., H ∼ Dirichlet(η). from hierarchical data collections, test the tree quality with 1 1 1 1 1 η=0.1 η=0.1 η=0.1 η=0.1 η=0.1 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0 0 0 0 0 0 5 10 0 5 10 0 5 10 0 5 10 0 5 10 1 1 1 1 1 η=1 η=1 η=1 η=1 η=1 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0 0 0 0 0 0 5 10 0 5 10 0 5 10 0 5 10 0 5 10 1 1 1 1 1 η=10 η=10 η=10 η=10 η=10 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0 0 0 0 0 0 5 10 0 5 10 0 5 10 0 5 10 0 5 10 Figure 2: Samples of Dirichlet distributions with different scale parameter settings (x-axis are the components [1-10], y-axis is the probability) Table 1: General Statistics on Data sets 120 Dataset rootId #Docs TermSpace #Golden Topics jacm wiki1 JACM N/A 536 1809 N/A wiki2 wiki1 2020309 748 7032 36 100 wiki3 wiki2 2337326 420 7047 46 dmoz1 wiki3 2319090 310 5516 29 dmoz2 80 dmoz3 dmoz1 38825 781 15082 68 number of topics dmoz2 39220 289 6290 16 dmoz3 38997 625 13907 87 60 40 different probability metrics as distance measure of our clus- tering, and use hLDA as the baseline to compare the tree 20 quality in the terms of Precision, Recall, F-Score. 4.1 Data set 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 To evaluate the learned tree quality, we use the Wikipedia topic scale parameter η (WIKI) dataset from the third Pascal Challenge on Large Scale Hierarchical Text Classification(LSHTC3)2 and the Open Directory Project(DMOZ) dataset from Second Pas- Figure 4: Topic counts estimated by HDP vs η set- cal Challenge onLarge Scale Hierarchical Text Classification tings ranging in [0.1,1] (LSHTC2)3 . Totally, we obtain three datasets from each of these two sources. All these datasets contain a hierarchy file defining the organization of each document into a hier- η settings. Fig.5 shows the distribution variances of the archical tree structure. Each document is assigned to one or learned topic collection from HDP on our datasets. For each more leaf nodes. The general statistics over these datasets η setting, the inner variance of that learned topic collection and the JACM one are shown in Table 1. is measured by taking an average on the symmetric KL di- Given the hierarchical relationship, we randomly choose vergence between every topic and the centroid distribution some sub-trees from it and build their corresponding golden of that collection. As shown, the variances almost drop con- line topic trees according to the term frequencies from doc- sistently while the η ranges from 0.1 to 1.0. This observation uments’ assignments. is consistent with the η’s consideration. 4.2 Scale effect of η settings on HDP 4.3 Evaluation method The scale parameter η is used as the granularity indica- Given the learned topic tree and the golden line topic tree, tor in our experiment. Fig.4 shows how the count of topics we want to measure how close these two structures are in a learned from HDP on our datasets changes under different quantitative metric. We use the ontology evaluation method proposed in [14], which can capture, quite accurately, the 2 http://lshtc.iit.demokritos.gr/LSHTC3 DATASETS deviations of learned structure from the gold line by means of 3 http://lshtc.iit.demokritos.gr/LSHTC2 datasets ontology alignment techniques. In this method, the ontology 0.3   1 jacm 0.25   0.9 wiki1 wiki2 0.2   average distance among learned topics skl   F-­‐score   wiki3 0.8 dmoz1 0.15   dmoz2 h   0.7 dmoz3 0.1   sx2   0.6 0.05   tvd   0   0.5 wiki1   wiki2   wiki3   dmoz1   dmoz2   dmoz3   0.4 dataset   0.3 Figure 6: F-Scores of learned tree on different prob- 0.2 ability metrics for hierarchical clustering distance 0.1 measure 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 topic scale parameter η setting and itself. TVD is the Total Variational Distance[7], it is Figure 5: Average Symmetric KL distance between used to measure the dissimilarity of two probability distri- learned topics under different Dirichlet scale param- butions. eter setting) 4.4 Probability metrics for hierarchical clus- tering’s distance measure concepts are defined as vector space representations which Gibbs et al.[6] reviewed on ten of the most popular prob- are the same as ours. We summarize this method as follows: ability metrics/distances used by statisticians and proba- bilists. We choose four of these and test their influence on 1. set up a one-to-one matching collection M = {1, · · · , |M |} our learned tree quality. The selected metrics are symmetric based on the dissimilarity measure between nodes of KL divergence(Dskl ), Hellinger distance(Dh ), and symmet- the learned tree L = {tl1 , · · · , tl|L| } and nodes of the ric χ2 distance(Dsχ2 ) whose definitions are given below, as golden tree G = {tg1 , · · · , tg|G| }, where well as the TVD(Dtvd ) defined in Equation.6. |M | = smaller(|G|, |L|); 2. for each matching m = (tli , tgj ), compute the Prob- V X pi abilistic Cotopy Precision(P CPm ) and Probabilistic KL(p, q) = pi log( ) Cotopy Recall(P CRm ); i=1 qi 3. take a weighted average of PCP and PCR to compute Dskl (p, q) = 1/2[KL(p, q) + KL(q, p)] (10) the P, R and the F-score, the weight is the similarity between the nodes of the matching pair. V X √ √ The corresponding formulas needed for steps 2 and 3 above Dh (p, q) = ( ( pi − qi )2 )1/2 (11) i=1 are shown in the following: V T |CS(tli ) CS(tgj )| X (pi − qi )2 P CPm = (4) Dχ2 (p, q) = |CS(tli )| i=1 qi T Dsχ2 (p, q) = 1/2[Dχ2 (p, q) + Dχ2 (q, p)] (12) |CS(tli ) CS(tgj )| P CRm = (5) Fig.6 plots the F-Scores of our learned topic trees from all |CS(tgj )| the datasets with different distance measures. As observed from this figure, all choices perform similarly, so we choose 1X TV D = |p(i) − q(i)|, T V D ∈ [0, 1]. (6) the symmetric KL divergence as our distance measure of the 2 i clustering in further experiment. The relationship between those measures can be found in [6]. |M | 1 X 4.5 Performance of HDP-C P = (1 − T V Di )P CPm (7) |M | m=1 We use hLDA as the baseline to learn a 4-depth topic tree with the scale parameter settings η = 1.0, η = 0.5, η = 0.25, |M | 1 X and η = 0.125 for each level. This is then compared to the R= (1 − T V Di )P CRm (8) top 4-depth sub-tree of our learned tree through HDP-C. To |M | m=1 be consistent with hLDA’s η settings, the top-level topics are learned with η = 1.0 and down-level topics are learned with P ∗R η = 0.125. We use the default value for other parameters: (9) F = P +R γ = 1.0, α0 = 1.0, and the max iteration is 1000. For hLDA, In above equations, the CS(t) is the Cotopy Set of node t, we set the max iteration to be 2000 due to that it gets a which includes all its direct and indirect super and subtopics bigger learning space than HDP. Table 2: Comparison of tree quality with baseline hLDA HDP − C Enhancement(%) P R F P R F P R F wiki1 0.323 0.459 0.189 0.346 0.464 0.198 7.1 1.1 4.8 wiki2 0.333 0.461 0.193 0.401 0.460 0.214 20.4 -0.2 10.9 wiki3 0.335 0.532 0.205 0.432 0.574 0.247 29.0 7.9 20.5 dmoz1 0.350 0.449 0.197 0.437 0.432 0.217 24.9 -3.8 10.2 dmoz2 0.234 0.472 0.156 0.389 0.497 0.218 66.2 5.3 39.7 dmoz3 0.302 0.364 0.165 0.445 0.392 0.208 47.4 7.7 26.1 The evaluation result is given in Table.2(Note that JACM [6] A. L. Gibbs, Francis, and E. Su. On choosing and dataset is not included here due to the lack of golden line bounding probability metrics. Internat. Statist. Rev., topic tree). In terms of the F-Score, our approach performs, pages 419–435, 2002. on average, 12.1% better on wiki datasets and 25.3% better [7] A. L. Gibbs and F. E. Su. On Choosing and Bounding on dmoz datasets. One reason is that, for hLDA each docu- Probability Metrics. International Statistical Review, ment only spans a single path from the root to a leaf node, 70:419–435, 2002. which is a quite tough restriction in the mixture of topics [8] T. Hofmann. The cluster-abstraction model: for each document. In contrast, our approach does not make unsupervised learning of topic hierarchies from text any prior restriction on each document’s topic choice. Actu- data. In Proceedings of the 16th international joint ally, each document can span any arbitrary sub-tree, which conference on Artificial intelligence - Volume 2, can explain its generative nature. IJCAI’99, pages 682–687, San Francisco, CA, USA, Besides, we observe from Table 2 that the improvement 1999. Morgan Kaufmann Publishers Inc. in terms of P is much better that R, which indicates that [9] W. Li, D. Blei, and A. Mccallum. Nonparametric our approach is more preferable to those tasks which care Bayes Pachinko Allocation. In UAI 07, 2007. the precision more. [10] W. Li and A. Mccallum. Pachinko allocation: Dag-structured mixture models of topic correlations. 5. CONCLUSIONS In In Proceedings of the 23rd International Conference on Machine Learning, pages 577–584, 2006. This paper builds a semantic topic tree representation for [11] D. Mimno, W. Li, and A. Mccallum. Mixtures of a document collection based on a non-parametric Bayesian hierarchical topics with pachinko allocation. In In topic model. Only the up-bound and low-bound topic sets Proceedings of the 24th International Conference on are directly inferred with the tuning on topic’s Dirichlet scale Machine Learning, pages 633–640, 2007. parameter for different levels of abstraction. A hierarchi- cal clustering algorithm(HDP-C) is proposed to derive the [12] F. Murtagh. A Survey of Recent Advances in middle level topics in order to construct the final topic tree. Hierarchical Clustering Algorithms. The Computer Our experimental study on several real world datasets shows Journal, 26(4):354–359, 1983. competitive performance of our approach. [13] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 6. ACKNOWLEDGMENTS 2006. The work described in this paper has been supported by [14] E. Zavitsanos, G. Paliouras, and G. Vouros. Gold the NSFC Overseas, HongKong & Macao Scholars Collabo- standard evaluation of ontology learning methods rated Researching Fund (61028003) and the Specialized Re- through ontology transformation and alignment. search Fund for the Doctoral Program of Higher Education, Knowledge and Data Engineering, IEEE Transactions China (20090141120050). on, 23(11):1635 –1648, nov. 2011. [15] E. Zavitsanos, G. Paliouras, and G. A. Vouros. Non-parametric estimation of topic hierarchies from 7. REFERENCES texts with hierarchical dirichlet processes. J. Mach. Learn. Res., 999888:2749–2775, Nov. 2011. [1] D. M. Blei. Introduction to probabilistic topic models. [16] E. Zavitsanos, S. Petridis, G. Paliouras, and G. A. Communications of the ACM, 2011. Vouros. Determining automatically the size of learned [2] D. M. Blei, T. L. Griffiths, and M. I. Jordan. The ontologies. In Proceedings of the 2008 conference on nested chinese restaurant process and bayesian ECAI 2008: 18th European Conference on Artificial nonparametric inference of topic hierarchies. J. ACM, Intelligence, pages 775–776, Amsterdam, The 57:7:1–7:30, February 2010. Netherlands, 2008. IOS Press. [3] D. M. Blei and J. D. Lafferty. A correlated topic model of science. AAS, 1(1):17–35, 2007. [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, Mar. 2003. [5] T. S. Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1(2):209–230, 1973.