Infinite Coauthor Topic Model (Infinite coAT): A Non- Parametric Generalization for coAT model Han Zhang Shuo Xu Xiaodong Qiao Information Technology Support Center, (corresponding author) Information Technology Support Center, Institute of Scientific and Technical Information Technology Support Center, Institute of Scientific and Technical Information of China(ISTIC) Institute of Scientific and Technical Information of China(ISTIC) No.15 Fuxing Rd., Haidian District, Information of China(ISTIC) No.15 Fuxing Rd., Haidian District, Beijing 100038, P.R. China No.15 Fuxing Rd., Haidian District, Beijing 100038, P.R. China zhanghan2012@istic.ac.cn Beijing 100038, P.R. China qiaox@istic.ac.cn xush@istic.ac.cn Zhaofeng Zhang Hongqi Han Information Technology Support Center, Information Technology Support Center, Institute of Scientific and Technical Institute of Scientific and Technical Information of China(ISTIC) Information of China(ISTIC) No.15 Fuxing Rd., Haidian District, No.15 Fuxing Rd., Haidian District, Beijing 100038, P.R. China Beijing 100038, P.R. China zhangzf@istic.ac.cn hanhq@istic.ac.cn ABSTRACT Inspired by the hierarchical Dirichlet process (HDP), we present a 1. INTRODUCTION A social network is a social structure made up of a set of generalized coAT (coauthor Topic) model, also called infinite social actors (such as individuals or organizations) and a set of the coAT model, in this paper. The infinite coAT model is a non- dyadic ties between these actors [1] [2]. It can simulate various parametric extension of the coAT model. And this model can social relationships among people, such as shared interests, automatically determine the number of topics which are regarded activities, backgrounds or real-life connections. And therefore for the probabilistic distribution of words. One does not need to social network analysis is very useful in measuring social provide prior information about the number of topics. In order to characteristics and structure [2-6]. However most existing keep the consistency with the coAT model, the Gibbs sampling is methods of social network analysis just consider the links between utilized to infer the parameters. Finally, experimental results on actors and ignore the attributes of links which may lead to several the US patents dataset from US Patent Office indicate that our serious problems, for example, misdeeming some obvious wrong infinite-coAT model is feasible and efficient. links for correct ones merely according to the number of collaborations between authors [7] and so on. Hence some Categories and Subject Descriptors methods considering both links and their attributes have been H.3.3 [Information Search and Retrieval]: proposed [8-11], including our previous work—coauthor topic (coAT) model which can identify actors with similar interests General Terms from social networks. Algorithms, Performance But in the coAT model, users have to input the prior information about the number of topics ahead of time. In fact, Keywords users don’t know the exact number of topics and therefore they coauthor topic (coAT) model, infinite coauthor topic (infinite- can just guess an approximation. Hence how to choose the coAT) model, stick-breaking prior, hierarchical Dirichlet number of topics is a frequently raised question. Inspired by processes, collapsed Gibbs sampling. hierarchical Dirichlet processes (HDP) [12] [13], in this article, we introduce stick-breaking prior in the coAT model to propose an infinite coAT model. Thus, the infinite coAT model can not only discover the shared interests between authors, but also infer the adequate number of topics automatically. The organization of the rest of this paper is as follow. In Copyright © 2014 for the individual papers by the papers' authors. Section 2, we briefly introduce the coAT model and its inference. Copying permitted for private and academic purposes. And then the non-parametric coAT model is proposed in Section 3, and the Gibbs sampling method is utilized to infer the model This volume is published and copyrighted by its editors. parameters in that section. In Section 4, experimental evaluations Published at Ceur-ws.org are conducted on US patents and Section 5 concludes this work. Proceedings of the First International Workshop on Patent Mining and Notations For the convenience of depiction we summarize the Its Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014. notations in Table 1. At KONVENS’14, October 8–10, 2014, Hildesheim, Germany. Table 1. Notation used in the models The coAT model [11] can be viewed as the following SYMBOL DESCRIPTION generative process: K Number of topics (1) For each topic k  [1,K]: M Number of documents (i) draw a multinomial  k from Dilichlet (β); (2) for each author pair (i, j) with i  [1,A-1], j  [i+1, A]: V Number of unique words A Number of unique authors (i) draw a multinomial i , j from Dirichlet (α); Nm Number of word tokens in document m Am Number of authors in document m (3) for each word n  [1, Nm] in document m  [1, M]: am Authors in document m (i) draw an author xm,n uniformly from the group of φk The multinomial distribution of words specific to authors am; the topic k (ii) draw another author ym,n uniformly from the group The multinomial distribution of topics specific to of authors am\ xm,n; ϑi,j the coauthor relationship (i , j). (iii) if xm,n> ym,n, to swap xm,n with ym,n; The topic assignment associated with the nth zm,n (iv) draw a topic assignment zm,n from multinomial token in the document m ( x , y ); wm,n The nth token in document m m ,n m ,n xm,n One chosen author associated with the word (v) draw a word wm,n from multinomial (  z ). token wm,n m ,n ym,n Another chosen author associated with the word Based on the generative process above, the coAT model has token wm,n two sets of unknown parameters: (1) Φ= {k }k1 and K Dirichlet priors (hyper-parameter) to the Θ= {{i , j }i 1 } j i 1 ;(2) the corresponding topic and author 𝛂 A1 A multinomial distribution ϑ in coAT model Dirichlet priors(hyper-parameter) to the pair assignments 𝑧m,n and (𝑥m,n, 𝑦m,n) for each word token 𝑤m,n. β multinomial distribution φ And the full conditional probability is as follow [11]: The root distribution of the hierarchical Dirichlet P( zm,n  k , xm,n  i, ym,n  j | w , z( m ,n ) , x( m ,n ) , y( m ,n ) , a, α,β) 𝛕 (1) processes in infinite coAT model ni(,kj)   k  1 nk( v )   v  1 scalar precision to the multinomial distribution ϑ    (ni(,kj)   k )  1  (n   )  1 K V 𝛂 k 1 v 1 (v) k v in infinite coAT model 𝛄 Dirichlet priors to the root distribution 𝛕 (v) where nk is the number of times tokens of word v is assigned 2. Coauthor Topic (coAT) model to topic 𝑘 and ni , j (k ) represent the number of times author pair (𝑖, In this section, we introduce the coAT model with a fixed number of topics briefly, and the graphical model representation of the 𝑗) is assigned to topic 𝑘.Then we get the parameter estimations coAT model is shown in Fig. 1 a). with their definitions and Bayes’ rules as follow [11]: nk( v )  v am am k ,v  (2)  (n   ) V (v) v 1 k v xm,n ym , n    ni(,kj)   k xm,n ym , n i , j ,k  (3)  (ni(,kj)   k ) K i , j zm , n i  [1, A  1] zm , n i , j  k 1 j  [i  1, A] i  [1, A  1] wm,n j  [i  1, A] k wm,n n  [1, N m ] k  [1, K ] k  3. Infinite Coauthor Topic (infinite coAT) n  [1, N m ] m  [1, M] k  [1, ) model—nonparametric coAT model  m  [1, M] How to choose the number of topics in coAT model is always a troublesome question. The hierarchical Dirichlet process (HDP) a) coAT b) infinite coAT [12] [13] provides a non-parametric method to solve this problem. Fig.1. Admixture models for documents and coauthor relationship: The method allows a prior over a countably infinite number of a) The coAT model, b) the non-parametric coAT model—infinite topics of which only a few will dominate the posterior. Inspired coAT model. by this method, we propose an infinite coAT model shown as Fig.1b). Based on the parametric coAT model the infinite coAT model splits the Dirichlet hyper-parameter α into a scalar precision α and a base distribution τ~Dir(γ/K)[13]. Taking this to Topic 1 the limit K→+∞, we can get the root distribution for the non- Word Prob. Co-inventor Prob. parametric coAT model. In this way, we can retain the structure of engine 0.05524 (Surnilla, Gopichandra; Roth, John M.) 0.97754 the parametric case for the Gibbs update of parameters: fuel 0.05332 (Yasui, Yuji; Akazaki, Shusuke) 0.97625 control 0.03385 (Lewis, Donald J.; Michelini, John O.) 0.97564 P( zm,n  k , xm,n  i, ym ,n  j | w , z( m,n ) , x( m,n ) , y( m,n ) , a,    ) exhaust 0.03152 (Pursifull, Ross Dykstra; Surnilla, Gopichandra) 0.97451  ni(,kj)   k  1 n( v )    1 system 0.02910 (Surnilla, Gopichandra; Smith, Stephen B.) 0.97230  K (k )  V k (v) v , if z  k   k 1 ni , j    1  v 1 (nk   v )  1 combustion 0.02766 (Lewis, Donald J.; Russell, John D.) 0.96848  (4) air 0.02521 (Bidner, David Karl; Cunningham, Ralph Wayne) 0.96833   k 1 1  , if z  knew method 0.02086 (Glugla, Chris Paul; Baskins, Robert Sarow) 0.96025  K n( k )    1 V   k 1 i , j ratio 0.01671 (Akazaki, Shusuke; Iwaki, Yoshihisa) 0.95992 internal 0.01658 (Leone, Thomas G.; Stein, Robert A.) 0.95740 Topic 4 Note that the sampling space has K+1dimensions because the Word Prob. Co-inventor Prob. root distribution τ provides K+1 possible states. We use ατ /V to K+1 oxide 0.02411 (Den, Tohru; Iwasaki, Tatsuya) 0.97003 present all unused topics. If ατ /V is sampled, a new topic is K+1 material 0.02376 (Baughman,Ray Henry;Zakhidov,Anvar Abdulahadovic) 0.95226 created as well. In that way, we can consider no information about layer 0.02227 (Suh, Dong-Seok; Baughman, Ray Henry) 0.95026 the number of topics and the model will output the result metal 0.02094 (Suh, Dong-Seok; Zakhidov, Anvar Abdulahadovic) 0.94531 automatically. film 0.01970 (Taylor, Earl J.; Moniz, Gary A.) 0.93500 According to the inference above, the importance of the root method 0.01897 (Ishihara, Tatsumi; Takita, Yusaku) 0.92722 distribution τ in the non-parametric model becomes obvious, and substrate 0.01799 (Godwin, Harold; Whiffen, David) 0.91776 how to sample τ is naturally a crucial problem. In this paper, we semiconductor 0.00765 (Shindo, Yuichiro; Takemoto, Kouichi) 0.91718 can sample τ by simulating how the new components are created thin 0.00949 (Itoh, Takashi; Kato, Katsuaki) 0.91239 and we can obtain a sequence of Bernoulli trials [13]: device 0.00905 (Ata, Masafumi; Ramm, Matthias) 0.90789  k Topic 6 p(mijkr  1)  r  [1, ni(k) , j ], m  [1, M ], k  [1, K ] (5)  k  r  1 Word Prob. Co-inventor Prob. air 0.04102 (Owen, Donald R.; Kravitz, David C.) 0.96377 The posterior of the top-level Dirichlet process τ is then sampled flow 0.02549 (Burbank, Jeffrey H.; Treu, Dennis M.) 0.96215 via [13] fluid 0.01982 (Brugger, James M.; Burbank, Jeffrey H.) 0.95740  ~ Dirichlet([m1 , , mk ],  ) system 0.01684 (Brugger, James M.; Treu, Dennis M.) 0.94809 (6) apparatus 0.01565 (McMillin, John R.; Strandwitz, Peter) 0.92530 mk   mijrk . pressure 0.01433 (Hess, Joseph; Muller, Myriam) 0.92202 with ijr device 0.01163 (Brassil, John; Taylor, Michael John) 0.92164 chamber 0.01117 (Yasuda, Yoshinobu; Nakazeki, Tsugito) 0.91810 4. Experimental results and discussions method 0.01005 (Johnstone; III, Albert E.) 0.91518 We downloaded US patents from US Patent Office 1 with the heat 0.00912 (Brassil, John; Schein, Douglas) 0.90206 following search strategy on Jun 25, 2014[search strategy: Topic 9 ICL/F02M069/48 or TTL/("gas sensor" or "air sensor") and (VOC Word Prob. Co-inventor Prob. OR CO OR formaldehyde) or ABST/("gas sensor" or "air sensor") vehicle 0.03134 (Grubbs, Michael R.; Kenny, Garry R.) 0.93642 and (VOC OR CO OR formaldehyde) or ACLM/("gas sensor" or electric 0.01319 (Ogawa, Gen; Senda, Satoru) 0.87615 "air sensor") and (VOC OR CO OR formaldehyde) or SPEC/("gas oil 0.01300 (Madan, Arun; Morrison, Scott) 0.85625 sensor" or "air sensor") and (VOC OR CO OR motor 0.01153 (Bingham, Lynn R.; Henke, Jerome R.) 0.85484 formaldehyde)].The dataset contains 4760 patent abstracts and control 0.00812 (Pursifull, Ross Dykstra; Lewis, Donald J.) 0.84167 7540 unique inventors, which is utilized to evaluate the heating 0.00763 (Yamada, Hirohiko; Kokubo, Naoki 0.84167 performance of our model. position 0.00734 (Hjort, Klas Anders; Lindberg, Mikael Peter Erik) 0.81897 compartment 0.00724 (Bunyard, Marc R.; Holst, Peter A.) 0.79787 In our experiment, the infinite coAT model calculates the assembly 0.00714 (Gibson, Alex O'Connor; Nedorezov, Felix) 0.79348 number of topics automatically which is 20. Because topics speed 0.00607 (Masuda, Satoshi; Kokubo, Naoki) 0.78889 consist of probabilities of words, so we list 5 topics, the top ten Topic 16 words belonging to these topics with their probabilities and the Word Prob. Co-inventor Prob. top ten co-inventor relationships which have the highest electron 0.00143 (Yokoyama, Yoshiaki; Kodama, Tooru) 0.58696 probability conditioned on those topics respectively in Table 2. soil 0.00143 (Takagi, Hiroshi; Takase, Hiromitsu) 0.50000 We can easily summarize the meaning of these topics. For elastomer 0.00143 (Boden, Mark W.; Bergquist, Robert A.) 0.36111 example, topic 1 is obviously about “engine”, topic 4 is about radiative 0.00098 (Leuthardt, Eric C.; Lord, Robert W.) 0.32143 “material” and so on. suppressing 0.00098 (Sato, Akira; Okamura, Masami) 0.13636 Table 2 An illustration of 5 topics from 20-topic solutions for air halides 0.00098 (Shiroma, Iris; Tomasco, Allan) 0.12500 sensor patent dataset inhalation 0.00054 (Berretta, Francine; Roberts, Joy) 0.12500 dioxins 0.00054 (Schielinsky, Gerhard; Kubach, Hans) 0.12500 program 0.00054 (Kamen,Dean L.;Langenfeld,Christopher C.) 0.09375 realized 0.00054 (Kubo, Yasuhiro; Ikegami, Eiji) 0.09375 1 http://patft.uspto.gov/netahtml/PTO/search-adv.htm Table 3 Co-invented patents between David Karl Bidner and stable with the dertermined number of topics 20. It is not difficult Ralph Wayne Cunningham to see that when the number of topics in the coAT model is Titles Topic belonged to greater than 45, the perplexity of coAT model is bigger than that of infinite coAT model. But in the coAT model, we don’t know Method and system for engine control Topic 1 choose what number of topics in advance, and what’s more we Particulate filter regeneration in an engine Topic 1 prefer the bigger number such as 100. Hence, without the Method and system for engine control Topic 1 information of the exact number of topics, the infinite coAT Particulate filter regeneration in an engine Topic 1 model outperforms the coAT model. Particulate filter regeneration in an engine Topic 1 Particulate filter regeneration during engine shutdown Topic 1 5. Conclusions Particulate filter regeneration in an engine coupled to an energy Topic 1 In this paper, we generalize the coAT model to a nonparametric conversion device counterpart--infinite coAT model, which can estimate the number Method and system for engine control Topic 1 of topics. In that way, the model can not only discover the shared Particulate filter regeneration during engine shutdown Topic 1 interests between inventors but also determine the number of topics automatically. Meanwhile, the experiments on US patent We take David Karl Bidner and Ralph Wayne Cunningham illustrate that the infinite coAT model is feasible. as an example, and list their co-invented patents’ titles in Table 3. From Table 3, one can easily find that their co-invented patents In ongoing work, we can consider infinite coAT model over are all about the engine which is the meaning of topic 1. In other time to discover dynamic shared interests among authors or use words, by comparing Table 3 with Table 2, it is not difficult to see this nonparametric method in other extended LDA models ,such that David Karl Bidner and Ralph Wayne Cunningham share as AToT models [14][15],to mine more useful information. interest Topic 1 with the strength of 0.96833 which illustrates that their co-invented patents all about topic 1 make sense. 6. ACKNOWLEDGMENTS In addition, in order to compare the performance of coAT This work is funded partially by the Natural Science Foundation and infinite coAT models, we use perplexity which is a standard of China: Research on Technology Opportunity Detection based measure to estimate the performance of probabilistic models to on Paper and Patent Information Resources under grant number evaluate our models. And the smaller the perplexity is, the better 71403255 and Study on the Disconnected Problem of Scientific the model performs. The perplexity is defined as the reciprocal Collaboration Network under grant number 71473237; Key geometric mean of the token likelihoods in the test set D = Technologies R&D Program of Chinese 12th Five-Year Plan { wm , a m } under the coAT or infinite coAT model: (2011–2015): Key Technologies Research on Data Mining from the Multiple Electric Vehicle Information Sources under grant   number 2013BAG06B01; and Key Work Project of Institute of  ln P coAT ( wm | am , B)  Scientific and Technical Information of China (ISTIC): Intelligent coAT ( wm | am , B)  exp   A ( A  1)  perplexity (7) Analysis Service Platform and Application Demonstration for  Nm  m m  Multi-Source Science and Technology Literature in the Era of Big  2  Data under grant number ZD2014-7-1.Our gratitude also goes to   the anonymous reviewers for their valuable comments.  ln PicoAT ( wm | am , B)  icoAT ( wm | am , B)  exp   A ( A  1)  perplexity (8)  Nm  m m  7. REFERENCES  2  [1] C. C. Aggarwal. Social network data analytics. Springer US, where B is the set of all the prior parameters. 2011. [2] M. E. J. Newman. Scientific collaboration networks. I. Network construction and fundamental results. Physical review letters, 2001, 64(1): 016131-016131~016138. [3] M. E. J. Newman. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Physical Review vol. 64, pp. 016132-1~7, 2001. [4] A. Abbasi. Exploring the Relationship between Research Impact and Collaborations for Information Science. In Proceedings of the 45th Hawaii International Conference on Systems Science (HICSS-45), Hawaii, USA, 2012. [5] Z. Zhang, Q. Li, D. Zeng, et al. User community discovery from multi-relational networks. Decision Support Systems, vol. 54, no.2, pp. 870-879, 2013. Fig.2 Perplexity of the test set D [6] H. Han , S. Xu, J. Gui , X. Qiao, L. Zhu, H.Zhang. Uncovering Research Topics of Academic Communities of Fig.2 shows the results of the coAT and infinite coAT model. Scientific Collaboration Network. International Journal of The perplexity increases in proportion to the number of topics, so Distributed Sensor Networks. 2014,4,529842,1-14. the perplexity of the coAT model increases with the number of topics increasing and the perplexity of infinite coAT model stays [7] W. Chi, J. Han, Y. Jia, et al. Mining advisor-advisee [12] Y .W. Teh, M.I. Jordan, M. J. Beal, et al. Hierarchical relationships from research publication networks. KDD’ 10, Dirichlet processes. Journal of the american statistical 2010. association, 2006, 101(476). [8] B. Taskar, A. Pieter, K. Daphne. Discriminative probabilistic [13] G. Heinrich. Infinite LDA implementing the HDP with models for relational data. Eighteenth Conference (2002) on minimum code complexity. Technical note, Feb, 170, 2011. Uncertainty in Artificial Intelligence, 2002: 485-492. [14] S. Xu, Q. Shi, X. Qiao, et al. Author-Topic over Time [9] L. E. Sucar. Probabilistic Graphical Models and Their (AToT): A Dynamic Users’ Interest Model. Mobile, Applications in Intelligent Environments. In Intelligent Ubiquitous, and Intelligent Computing. Springer Berlin Environments (IE), 2012 8th International Conference on, Heidelberg, 2014: 239-245. 2012: 11-15 [15] S. Xu, Q. Shi, X. Qiao, et al. A dynamic users’ interest [10] P. Larrañaga, H. Karshenas, C. Bielza , et al. A review on discovery model with distributed inference algorithm. probabilistic graphical models in evolutionary computation. International Journal of Distributed Sensor Networks, 2014, Journal of Heuristics, 2012: 1-25. 2014. [11] X. An, S. Xu, Y. Wen, et al. A Shared Interest Discovery Model for Coauthor Relationship in SNS. International Journal of Distributed Sensor Networks, 2014, 2014.