-

Infinite Coauthor Topic Model (Infinite coAT): A Non- Parametric Generalization for coAT model

Han Zhang

zhanghan2012@istic.ac.cn 2

Shuo Xu

Information Technology Support Center,

xush@istic.ac.cn 3

Xiaodong Qiao

qiaox@istic.ac.cn 2

Zhaofeng Zhang

zhangzf@istic.ac.cn 2

Hongqi Han

hanhq@istic.ac.cn 2

General Terms

1 0 (corresponding author) 1 Algorithms , Performance 2 Information Technology Support Center, Institute of Scientific and Technical, Information of China(ISTIC) , No.15 Fuxing Rd., Haidian District, Beijing 100038 , P.R. China 3 Institute of Scientific and Technical, Information of China(ISTIC) , No.15 Fuxing Rd., Haidian District, Beijing 100038 , P.R. China

2014

8 10

Inspired by the hierarchical Dirichlet process (HDP), we present a generalized coAT (coauthor Topic) model, also called infinite coAT model, in this paper. The infinite coAT model is a nonparametric extension of the coAT model. And this model can automatically determine the number of topics which are regarded for the probabilistic distribution of words. One does not need to provide prior information about the number of topics. In order to keep the consistency with the coAT model, the Gibbs sampling is utilized to infer the parameters. Finally, experimental results on the US patents dataset from US Patent Office indicate that our infinite-coAT model is feasible and efficient.

eol>coauthor topic (coAT) model infinite coauthor topic (infinitecoAT) model stick-breaking prior hierarchical Dirichlet processes collapsed Gibbs sampling

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Copying permitted for private and academic purposes.

This volume is published and copyrighted by its editors.

Published at Ceur-ws.org Proceedings of the First International Workshop on Patent Mining and Its Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014.

1. INTRODUCTION

A social network is a social structure made up of a set of social actors (such as individuals or organizations) and a set of the dyadic ties between these actors [1] [2]. It can simulate various social relationships among people, such as shared interests, activities, backgrounds or real-life connections. And therefore social network analysis is very useful in measuring social characteristics and structure [ 2-6 ]. However most existing methods of social network analysis just consider the links between actors and ignore the attributes of links which may lead to several serious problems, for example, misdeeming some obvious wrong links for correct ones merely according to the number of collaborations between authors [ 7 ] and so on. Hence some methods considering both links and their attributes have been proposed [ 8-11 ], including our previous work—coauthor topic (coAT) model which can identify actors with similar interests from social networks.

But in the coAT model, users have to input the prior information about the number of topics ahead of time. In fact, users don’t know the exact number of topics and therefore they can just guess an approximation. Hence how to choose the number of topics is a frequently raised question. Inspired by hierarchical Dirichlet processes (HDP) [ 12 ] [ 13 ], in this article, we introduce stick-breaking prior in the coAT model to propose an infinite coAT model. Thus, the infinite coAT model can not only discover the shared interests between authors, but also infer the adequate number of topics automatically.

The organization of the rest of this paper is as follow. In Section 2, we briefly introduce the coAT model and its inference. And then the non-parametric coAT model is proposed in Section 3, and the Gibbs sampling method is utilized to infer the model parameters in that section. In Section 4, experimental evaluations are conducted on US patents and Section 5 concludes this work. Notations For the convenience of depiction we summarize the notations in Table 1. The coAT model [ 11 ] can be viewed as the following 2. Coauthor Topic (coAT) model In this section, we introduce the coAT model with a fixed number of topics briefly, and the graphical model representation of the coAT model is shown in Fig. 1 a).

xm,n

ym,n a

m zm,n wm,n n [1, Nm ] (2) for each author pair (i, j) with i [1,A-1], j [i+1, A]: (i) draw a multinomial i, j

from Dirichlet (α); (3) for each word n [1, Nm] in document m [1, M]: (i) draw an author xm,n uniformly from the group of authors am; of authors am\ xm,n; ( xm,n ,ym,n

); (ii) draw another author ym,n uniformly from the group (iii) if xm,n> ym,n, to swap xm,n with ym,n; (iv) draw a topic assignment zm,n from multinomial (v) draw a word wm,n from multinomial ( zm,n ).

Based on the generative process above, the coAT model has K two sets of unknown parameters: (1) Φ= { k }k1 and Θ= {{i, j }i1

A1} jiA1 ;(2) the corresponding topic and author pair assignments m,n and ( m,n, m,n) for each word token m,n. And the full conditional probability is as follow [ 11 ]: P(zm,n  k, xm,n  i, ym,n  j | w, z(m,n) , x(m,n) , y(m,n) , a,α,β) ni(,kj)  k 1

nk(v)   v 1 

 kK1(ni(,kj)  k ) 1

Vv1(nk(v)   v ) 1 where n(v) k is the number of times tokens of word v is assigned

i, j to topic and n(k ) represent the number of times author pair ( , ) is assigned to topic .Then we get the parameter estimations with their definitions and Bayes’ rules as follow [ 11 ]：  k ,v  i, j,k 

k n(v)  

v 

V v1

(nk(v)   v ) ni(,kj) 

k 

K k 1 (ni(,kj)  k ) (1) (2) (3) 3. Infinite Coauthor Topic (infinite coAT) model—nonparametric coAT model How to choose the number of topics in coAT model is always a troublesome question. The hierarchical Dirichlet process (HDP) [ 12 ] [ 13 ] provides a non-parametric method to solve this problem. The method allows a prior over a countably infinite number of topics of which only a few will dominate the posterior. Inspired by this method, we propose an infinite coAT model shown as model splits the Dirichlet hyper-parameter α into a scalar precision α and a base distribution τ~Dir(γ/K)[ 13 ]. Taking this to the limit K→+∞, we can get the root distribution for the nonparametric coAT model. In this way, we can retain the structure of the parametric case for the Gibbs update of parameters: P( zm,n  k , xm,n  i, ym,n  j | w, z(m,n) , x(m,n) , y(m,n) , a,   )  ni(,kj)  k 1   kK1 ni(,kj)  1     k 1   kK1 ni(,kj)  1 , if z  k (4)

Note that the sampling space has K+1dimensions because the root distribution τ provides K+1 possible states. We use ατK+1/V to present all unused topics. If ατK+1/V is sampled, a new topic is created as well. In that way, we can consider no information about the number of topics and the model will output the result automatically.

According to the inference above, the importance of the root distribution τ in the non-parametric model becomes obvious, and how to sample τ is naturally a crucial problem. In this paper, we can sample τ by simulating how the new components are created and we can obtain a sequence of Bernoulli trials [ 13 ]: p(mijkr  1)   k kr 1 r [1, ni(,kj) ], m [1, M ], k [1, K ] (5) The posterior of the top-level Dirichlet process τ is then sampled via [ 13 ]  ~ Dirichlet([m1, , mk ], ) (6) with mk   mijrk .

ijr 4. Experimental results and discussions We downloaded US patents from US Patent Office 1 with the following search strategy on Jun 25, 2014[search strategy: ICL/F02M069/48 or TTL/("gas sensor" or "air sensor") and (VOC OR CO OR formaldehyde) or ABST/("gas sensor" or "air sensor") and (VOC OR CO OR formaldehyde) or ACLM/("gas sensor" or "air sensor") and (VOC OR CO OR formaldehyde) or SPEC/("gas sensor" or "air sensor") and (VOC OR CO OR formaldehyde)].The dataset contains 4760 patent abstracts and 7540 unique inventors, which is utilized to evaluate the performance of our model.

In our experiment, the infinite coAT model calculates the number of topics automatically which is 20. Because topics consist of probabilities of words, so we list 5 topics, the top ten words belonging to these topics with their probabilities and the top ten co-inventor relationships which have the highest probability conditioned on those topics respectively in Table 2. We can easily summarize the meaning of these topics. For example, topic 1 is obviously about “engine”, topic 4 is about “material” and so on. 1 http://patft.uspto.gov/netahtml/PTO/search-adv.htm

We take David Karl Bidner and Ralph Wayne Cunningham as an example, and list their co-invented patents’ titles in Table 3. From Table 3, one can easily find that their co-invented patents are all about the engine which is the meaning of topic 1. In other words, by comparing Table 3 with Table 2, it is not difficult to see that David Karl Bidner and Ralph Wayne Cunningham share interest Topic 1 with the strength of 0.96833 which illustrates that their co-invented patents all about topic 1 make sense.

In addition, in order to compare the performance of coAT and infinite coAT models, we use perplexity which is a standard measure to estimate the performance of probabilistic models to evaluate our models. And the smaller the perplexity is, the better the model performs. The perplexity is defined as the reciprocal geometric mean of the token likelihoods in the test set D = { wm , am } under the coAT or infinite coAT model:    ln PcoAT (wm | am , B)  perplexitycoAT (wm | am , B)  exp   Nm  Am ( A2m 1)     ln PicoAT (wm | am , B)  perplexityicoAT (wm | am , B)  exp 

 Nm  Am ( A2m 1)  where B is the set of all the prior parameters. (7) (8)

Fig.2 shows the results of the coAT and infinite coAT model. The perplexity increases in proportion to the number of topics, so the perplexity of the coAT model increases with the number of topics increasing and the perplexity of infinite coAT model stays stable with the dertermined number of topics 20. It is not difficult to see that when the number of topics in the coAT model is greater than 45, the perplexity of coAT model is bigger than that of infinite coAT model. But in the coAT model, we don’t know choose what number of topics in advance, and what’s more we prefer the bigger number such as 100. Hence, without the information of the exact number of topics, the infinite coAT model outperforms the coAT model.

5. Conclusions

In this paper, we generalize the coAT model to a nonparametric counterpart--infinite coAT model, which can estimate the number of topics. In that way, the model can not only discover the shared interests between inventors but also determine the number of topics automatically. Meanwhile, the experiments on US patent illustrate that the infinite coAT model is feasible.

In ongoing work, we can consider infinite coAT model over time to discover dynamic shared interests among authors or use this nonparametric method in other extended LDA models ,such as AToT models [ 14 ][ 15 ],to mine more useful information.

6. ACKNOWLEDGMENTS

This work is funded partially by the Natural Science Foundation of China: Research on Technology Opportunity Detection based on Paper and Patent Information Resources under grant number 71403255 and Study on the Disconnected Problem of Scientific Collaboration Network under grant number 71473237; Key Technologies R&D Program of Chinese 12th Five-Year Plan (2011–2015): Key Technologies Research on Data Mining from the Multiple Electric Vehicle Information Sources under grant number 2013BAG06B01; and Key Work Project of Institute of Scientific and Technical Information of China (ISTIC): Intelligent Analysis Service Platform and Application Demonstration for Multi-Source Science and Technology Literature in the Era of Big Data under grant number ZD2014-7-1.Our gratitude also goes to the anonymous reviewers for their valuable comments.

7. REFERENCES

[1] C. C. Aggarwal. Social network data analytics. Springer US, 2011. [2] M. E. J. Newman. Scientific collaboration networks. I.

Network construction and fundamental results. Physical review letters, 2001, 64(1): 016131-016131~016138. [3] M. E. J. Newman. Scientific collaboration networks. II.

Shortest paths, weighted networks, and centrality. Physical Review vol. 64, pp. 016132-1~7, 2001. [4] A. Abbasi. Exploring the Relationship between Research Impact and Collaborations for Information Science. In Proceedings of the 45th Hawaii International Conference on Systems Science (HICSS-45), Hawaii, USA, 2012. [5] Z. Zhang, Q. Li, D. Zeng, et al. User community discovery from multi-relational networks. Decision Support Systems, vol. 54, no.2, pp. 870-879, 2013.

[6]

Han ,

Xu ,

Gui ,

Qiao ,

Zhu ,

Zhang . Uncovering Research Topics of Academic Communities of Scientific Collaboration Network. International Journal of Distributed Sensor Networks . 2014 , 4 , 529842 , 1 - 14 .

[7]

Chi , J. Han,

Jia , et al. Mining advisor-advisee relationships from research publication networks . KDD' 10 , 2010 .

[8]

Taskar ,

Pieter ,

Daphne . Discriminative probabilistic models for relational data . Eighteenth Conference (2002) on Uncertainty in Artificial Intelligence , 2002 : 485 - 492 .

[9]

L. E.

Sucar . Probabilistic Graphical Models and Their Applications in Intelligent Environments . In Intelligent Environments (IE) , 2012 8th International Conference on, 2012 : 11 - 15

[10]

Larrañaga ,

Karshenas ,

Bielza , et al. A review on probabilistic graphical models in evolutionary computation . Journal of Heuristics , 2012 : 1 - 25 .

[11]

An ,

Xu ,

Wen , et al. A Shared Interest Discovery Model for Coauthor Relationship in SNS . International Journal of Distributed Sensor Networks , 2014 , 2014 .

[12]

Y .W.

Teh ,

M.I.

Jordan ,

M. J.

Beal , et al. Hierarchical Dirichlet processes . Journal of the american statistical association , 2006 , 101 ( 476 ).

[13]

Heinrich . Infinite LDA implementing the HDP with minimum code complexity . Technical note , Feb, 170 , 2011 .

[14]

Xu ,

Shi ,

Qiao , et al. Author-Topic over Time (AToT): A Dynamic Users' Interest Model . Mobile, Ubiquitous, and Intelligent Computing. Springer Berlin Heidelberg, 2014 : 239 - 245 .

[15]

Xu ,

Shi ,

Qiao , et al. A dynamic users' interest discovery model with distributed inference algorithm . International Journal of Distributed Sensor Networks , 2014 , 2014 .