<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Infinite Coauthor Topic Model (Infinite coAT): A Non- Parametric Generalization for coAT model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Han Zhang</string-name>
          <email>zhanghan2012@istic.ac.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuo Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Information Technology Support Center,</string-name>
          <email>xush@istic.ac.cn</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaodong Qiao</string-name>
          <email>qiaox@istic.ac.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhaofeng Zhang</string-name>
          <email>zhangzf@istic.ac.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongqi Han</string-name>
          <email>hanhq@istic.ac.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>(corresponding author)</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Algorithms</institution>
          ,
          <addr-line>Performance</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Information Technology Support Center, Institute of Scientific and Technical, Information of China(ISTIC)</institution>
          ,
          <addr-line>No.15 Fuxing Rd., Haidian District, Beijing 100038</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Institute of Scientific and Technical, Information of China(ISTIC)</institution>
          ,
          <addr-line>No.15 Fuxing Rd., Haidian District, Beijing 100038</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>8</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Inspired by the hierarchical Dirichlet process (HDP), we present a generalized coAT (coauthor Topic) model, also called infinite coAT model, in this paper. The infinite coAT model is a nonparametric extension of the coAT model. And this model can automatically determine the number of topics which are regarded for the probabilistic distribution of words. One does not need to provide prior information about the number of topics. In order to keep the consistency with the coAT model, the Gibbs sampling is utilized to infer the parameters. Finally, experimental results on the US patents dataset from US Patent Office indicate that our infinite-coAT model is feasible and efficient.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;coauthor topic (coAT) model</kwd>
        <kwd>infinite coauthor topic (infinitecoAT) model</kwd>
        <kwd>stick-breaking prior</kwd>
        <kwd>hierarchical Dirichlet processes</kwd>
        <kwd>collapsed Gibbs sampling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]:
Copying permitted for private and academic purposes.</p>
      <p>This volume is published and copyrighted by its editors.</p>
      <p>Published at Ceur-ws.org
Proceedings of the First International Workshop on Patent Mining and
Its Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014.</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        A social network is a social structure made up of a set of
social actors (such as individuals or organizations) and a set of the
dyadic ties between these actors [1] [2]. It can simulate various
social relationships among people, such as shared interests,
activities, backgrounds or real-life connections. And therefore
social network analysis is very useful in measuring social
characteristics and structure [
        <xref ref-type="bibr" rid="ref1">2-6</xref>
        ]. However most existing
methods of social network analysis just consider the links between
actors and ignore the attributes of links which may lead to several
serious problems, for example, misdeeming some obvious wrong
links for correct ones merely according to the number of
collaborations between authors [
        <xref ref-type="bibr" rid="ref2">7</xref>
        ] and so on. Hence some
methods considering both links and their attributes have been
proposed [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">8-11</xref>
        ], including our previous work—coauthor topic
(coAT) model which can identify actors with similar interests
from social networks.
      </p>
      <p>
        But in the coAT model, users have to input the prior
information about the number of topics ahead of time. In fact,
users don’t know the exact number of topics and therefore they
can just guess an approximation. Hence how to choose the
number of topics is a frequently raised question. Inspired by
hierarchical Dirichlet processes (HDP) [
        <xref ref-type="bibr" rid="ref7">12</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">13</xref>
        ], in this article,
we introduce stick-breaking prior in the coAT model to propose
an infinite coAT model. Thus, the infinite coAT model can not
only discover the shared interests between authors, but also infer
the adequate number of topics automatically.
      </p>
      <p>
        The organization of the rest of this paper is as follow. In
Section 2, we briefly introduce the coAT model and its inference.
And then the non-parametric coAT model is proposed in Section
3, and the Gibbs sampling method is utilized to infer the model
parameters in that section. In Section 4, experimental evaluations
are conducted on US patents and Section 5 concludes this work.
Notations For the convenience of depiction we summarize the
notations in Table 1.
The coAT model [
        <xref ref-type="bibr" rid="ref6">11</xref>
        ] can be viewed as the following
2. Coauthor Topic (coAT) model
In this section, we introduce the coAT model with a fixed number
of topics briefly, and the graphical model representation of the
coAT model is shown in Fig. 1 a).
      </p>
      <p>xm,n</p>
      <p>ym,n
a</p>
      <p>m
zm,n
wm,n
n [1, Nm ]
(2) for each author pair (i, j) with i [1,A-1], j [i+1, A]:
(i) draw a multinomial i, j</p>
      <p>from Dirichlet (α);
(3) for each word n [1, Nm] in document m [1, M]:
(i) draw an author xm,n uniformly from the group of
authors am;
of authors am\ xm,n;
(
xm,n ,ym,n</p>
      <p>);
(ii) draw another author ym,n uniformly from the group
(iii) if xm,n&gt; ym,n, to swap xm,n with ym,n;
(iv) draw a topic assignment zm,n from multinomial
(v) draw a word wm,n from multinomial (
zm,n
).</p>
      <p>Based on the generative process above, the coAT model has
K
two sets of unknown parameters: (1) Φ= { k }k1
and
Θ= {{i, j }i1</p>
      <p>
        A1} jiA1 ;(2) the corresponding topic and author
pair assignments  m,n and ( m,n,  m,n) for each word token  m,n.
And the full conditional probability is as follow [
        <xref ref-type="bibr" rid="ref6">11</xref>
        ]:
P(zm,n  k, xm,n  i, ym,n  j | w, z(m,n) , x(m,n) , y(m,n) , a,α,β)
ni(,kj)  k 1
      </p>
      <p>nk(v)   v 1
</p>
      <p>
kK1(ni(,kj)  k ) 1</p>
      <p>Vv1(nk(v)   v ) 1
where n(v)
k
is the number of times tokens of word v is assigned</p>
      <p>
        i, j
to topic  and n(k ) represent the number of times author pair ( ,
 ) is assigned to topic  .Then we get the parameter estimations
with their definitions and Bayes’ rules as follow [
        <xref ref-type="bibr" rid="ref6">11</xref>
        ]：
 k ,v 
i, j,k 
      </p>
      <p>k
n(v)  </p>
      <p>v
</p>
      <p>V
v1</p>
      <p>(nk(v)   v )
ni(,kj) </p>
      <p>k
</p>
      <p>
        K
k 1 (ni(,kj)  k )
(1)
(2)
(3)
3. Infinite Coauthor Topic (infinite coAT)
model—nonparametric coAT model
How to choose the number of topics in coAT model is always a
troublesome question. The hierarchical Dirichlet process (HDP)
[
        <xref ref-type="bibr" rid="ref7">12</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">13</xref>
        ] provides a non-parametric method to solve this problem.
The method allows a prior over a countably infinite number of
topics of which only a few will dominate the posterior. Inspired
by this method, we propose an infinite coAT model shown as
model splits the Dirichlet hyper-parameter α into a scalar
precision α and a base distribution τ~Dir(γ/K)[
        <xref ref-type="bibr" rid="ref8">13</xref>
        ]. Taking this to
the limit K→+∞, we can get the root distribution for the
nonparametric coAT model. In this way, we can retain the structure of
the parametric case for the Gibbs update of parameters:
P( zm,n  k , xm,n  i, ym,n  j | w, z(m,n) , x(m,n) , y(m,n) , a,   )
 ni(,kj)  k 1
  kK1 ni(,kj)  1
 
  k 1
  kK1 ni(,kj)  1
,
if z  k
(4)
      </p>
      <p>Note that the sampling space has K+1dimensions because the
root distribution τ provides K+1 possible states. We use ατK+1/V to
present all unused topics. If ατK+1/V is sampled, a new topic is
created as well. In that way, we can consider no information about
the number of topics and the model will output the result
automatically.</p>
      <p>
        According to the inference above, the importance of the root
distribution τ in the non-parametric model becomes obvious, and
how to sample τ is naturally a crucial problem. In this paper, we
can sample τ by simulating how the new components are created
and we can obtain a sequence of Bernoulli trials [
        <xref ref-type="bibr" rid="ref8">13</xref>
        ]:
p(mijkr  1)   k kr 1 r [1, ni(,kj) ], m [1, M ], k [1, K ] (5)
The posterior of the top-level Dirichlet process τ is then sampled
via [
        <xref ref-type="bibr" rid="ref8">13</xref>
        ]
 ~ Dirichlet([m1,
, mk ], )
(6)
with
mk   mijrk .
      </p>
      <p>ijr
4. Experimental results and discussions
We downloaded US patents from US Patent Office 1 with the
following search strategy on Jun 25, 2014[search strategy:
ICL/F02M069/48 or TTL/("gas sensor" or "air sensor") and (VOC
OR CO OR formaldehyde) or ABST/("gas sensor" or "air sensor")
and (VOC OR CO OR formaldehyde) or ACLM/("gas sensor" or
"air sensor") and (VOC OR CO OR formaldehyde) or SPEC/("gas
sensor" or "air sensor") and (VOC OR CO OR
formaldehyde)].The dataset contains 4760 patent abstracts and
7540 unique inventors, which is utilized to evaluate the
performance of our model.</p>
      <p>In our experiment, the infinite coAT model calculates the
number of topics automatically which is 20. Because topics
consist of probabilities of words, so we list 5 topics, the top ten
words belonging to these topics with their probabilities and the
top ten co-inventor relationships which have the highest
probability conditioned on those topics respectively in Table 2.
We can easily summarize the meaning of these topics. For
example, topic 1 is obviously about “engine”, topic 4 is about
“material” and so on.
1 http://patft.uspto.gov/netahtml/PTO/search-adv.htm</p>
      <p>We take David Karl Bidner and Ralph Wayne Cunningham
as an example, and list their co-invented patents’ titles in Table 3.
From Table 3, one can easily find that their co-invented patents
are all about the engine which is the meaning of topic 1. In other
words, by comparing Table 3 with Table 2, it is not difficult to see
that David Karl Bidner and Ralph Wayne Cunningham share
interest Topic 1 with the strength of 0.96833 which illustrates that
their co-invented patents all about topic 1 make sense.</p>
      <p>In addition, in order to compare the performance of coAT
and infinite coAT models, we use perplexity which is a standard
measure to estimate the performance of probabilistic models to
evaluate our models. And the smaller the perplexity is, the better
the model performs. The perplexity is defined as the reciprocal
geometric mean of the token likelihoods in the test set D =
{ wm , am } under the coAT or infinite coAT model:
 
 ln PcoAT (wm | am , B) 
perplexitycoAT (wm | am , B)  exp 
 Nm  Am ( A2m 1) 
 
 ln PicoAT (wm | am , B) 
perplexityicoAT (wm | am , B)  exp </p>
      <p> Nm  Am ( A2m 1) 
where B is the set of all the prior parameters.
(7)
(8)</p>
      <p>Fig.2 shows the results of the coAT and infinite coAT model.
The perplexity increases in proportion to the number of topics, so
the perplexity of the coAT model increases with the number of
topics increasing and the perplexity of infinite coAT model stays
stable with the dertermined number of topics 20. It is not difficult
to see that when the number of topics in the coAT model is
greater than 45, the perplexity of coAT model is bigger than that
of infinite coAT model. But in the coAT model, we don’t know
choose what number of topics in advance, and what’s more we
prefer the bigger number such as 100. Hence, without the
information of the exact number of topics, the infinite coAT
model outperforms the coAT model.</p>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusions</title>
      <p>In this paper, we generalize the coAT model to a nonparametric
counterpart--infinite coAT model, which can estimate the number
of topics. In that way, the model can not only discover the shared
interests between inventors but also determine the number of
topics automatically. Meanwhile, the experiments on US patent
illustrate that the infinite coAT model is feasible.</p>
      <p>
        In ongoing work, we can consider infinite coAT model over
time to discover dynamic shared interests among authors or use
this nonparametric method in other extended LDA models ,such
as AToT models [
        <xref ref-type="bibr" rid="ref9">14</xref>
        ][
        <xref ref-type="bibr" rid="ref10">15</xref>
        ],to mine more useful information.
      </p>
    </sec>
    <sec id="sec-4">
      <title>6. ACKNOWLEDGMENTS</title>
      <p>This work is funded partially by the Natural Science Foundation
of China: Research on Technology Opportunity Detection based
on Paper and Patent Information Resources under grant number
71403255 and Study on the Disconnected Problem of Scientific
Collaboration Network under grant number 71473237; Key
Technologies R&amp;D Program of Chinese 12th Five-Year Plan
(2011–2015): Key Technologies Research on Data Mining from
the Multiple Electric Vehicle Information Sources under grant
number 2013BAG06B01; and Key Work Project of Institute of
Scientific and Technical Information of China (ISTIC): Intelligent
Analysis Service Platform and Application Demonstration for
Multi-Source Science and Technology Literature in the Era of Big
Data under grant number ZD2014-7-1.Our gratitude also goes to
the anonymous reviewers for their valuable comments.</p>
    </sec>
    <sec id="sec-5">
      <title>7. REFERENCES</title>
      <p>[1] C. C. Aggarwal. Social network data analytics. Springer US,
2011.
[2] M. E. J. Newman. Scientific collaboration networks. I.</p>
      <p>Network construction and fundamental results. Physical
review letters, 2001, 64(1): 016131-016131~016138.
[3] M. E. J. Newman. Scientific collaboration networks. II.</p>
      <p>Shortest paths, weighted networks, and centrality. Physical
Review vol. 64, pp. 016132-1~7, 2001.
[4] A. Abbasi. Exploring the Relationship between Research
Impact and Collaborations for Information Science. In
Proceedings of the 45th Hawaii International Conference on
Systems Science (HICSS-45), Hawaii, USA, 2012.
[5] Z. Zhang, Q. Li, D. Zeng, et al. User community discovery
from multi-relational networks. Decision Support Systems,
vol. 54, no.2, pp. 870-879, 2013.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          . Uncovering Research Topics of Academic Communities of Scientific Collaboration Network.
          <source>International Journal of Distributed Sensor Networks</source>
          .
          <year>2014</year>
          ,
          <volume>4</volume>
          ,
          <issue>529842</issue>
          ,
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Chi</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          , et al.
          <article-title>Mining advisor-advisee relationships from research publication networks</article-title>
          .
          <source>KDD' 10</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Taskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pieter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Daphne</surname>
          </string-name>
          .
          <article-title>Discriminative probabilistic models for relational data</article-title>
          .
          <source>Eighteenth Conference (2002) on Uncertainty in Artificial Intelligence</source>
          ,
          <year>2002</year>
          :
          <fpage>485</fpage>
          -
          <lpage>492</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Sucar</surname>
          </string-name>
          .
          <article-title>Probabilistic Graphical Models and Their Applications in Intelligent Environments</article-title>
          .
          <source>In Intelligent Environments (IE)</source>
          ,
          <year>2012</year>
          8th International Conference on,
          <year>2012</year>
          :
          <fpage>11</fpage>
          -
          <lpage>15</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Larrañaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Karshenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bielza</surname>
          </string-name>
          , et al.
          <article-title>A review on probabilistic graphical models in evolutionary computation</article-title>
          .
          <source>Journal of Heuristics</source>
          ,
          <year>2012</year>
          :
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen</surname>
          </string-name>
          , et al.
          <article-title>A Shared Interest Discovery Model for Coauthor Relationship in SNS</article-title>
          .
          <source>International Journal of Distributed Sensor Networks</source>
          ,
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y .W.</given-names>
            <surname>Teh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Beal</surname>
          </string-name>
          , et al.
          <article-title>Hierarchical Dirichlet processes</article-title>
          .
          <source>Journal of the american statistical association</source>
          ,
          <year>2006</year>
          ,
          <volume>101</volume>
          (
          <issue>476</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          .
          <article-title>Infinite LDA implementing the HDP with minimum code complexity</article-title>
          .
          <source>Technical note</source>
          , Feb,
          <volume>170</volume>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiao</surname>
          </string-name>
          , et al.
          <article-title>Author-Topic over Time (AToT): A Dynamic Users' Interest Model</article-title>
          . Mobile, Ubiquitous, and Intelligent Computing. Springer Berlin Heidelberg,
          <year>2014</year>
          :
          <fpage>239</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiao</surname>
          </string-name>
          , et al.
          <article-title>A dynamic users' interest discovery model with distributed inference algorithm</article-title>
          .
          <source>International Journal of Distributed Sensor Networks</source>
          ,
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>