<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Incorporating Metadata into Dynamic Topic Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tianxi Li</string-name>
          <email>tianxili@stanford.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Branislav Kveton</string-name>
          <email>Branislav.Kveton@technicolor.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Wu</string-name>
          <email>yuw2@stanford.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashwin Kashyap</string-name>
          <email>Ashwin.Kashyap@technicolor.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Stanford University</institution>
          ,
          <addr-line>Stanford, CA 94305</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technicolor</institution>
          ,
          <addr-line>Palo Alto, CA 94301</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Everyday millions of blogs and micro-blogs are posted on the Internet These posts usually come with useful metadata, such as tags, authors, locations, etc. Much of these data are highly speci c or personalized. Tracking the evolution of these data helps us to discover trending topics and users' interests, which are key factors in recommendation and advertisement placement systems. In this paper, we use topic models to analyze topic evolution in social media corpora with the help of metadata. Speci cally, we propose a exible dynamic topic model which can easily incorporate various type of metadata. Since our model adds negligible computation cost on the top of Latent Dirichlet Allocation, it can be implemented very e ciently. We test our model on both Twitter data and NIPS paper collection. The results show that our approach provides better performance in terms of held-out likelihood, yet still retains good interpretability.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Topic evolution analysis has become increasingly
important in recent years. Such analysis on social
media and webpages could help people understand
information spreading better. In addition, it also
provides ways to understand latent patterns of corpus,
reduce e ective dimensionality and classify documents
and data. Meanwhile, reseachers manage to t
various types of data into the topic model. For example,
image segmentations was modeled as topics in Feifei
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. User behaviors were also modeled as topics
as in Ahmed et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] In such circumstances, topic
evolution gains other practical values. For example,
knowing the evolution of people's behaviors could
improve the performance of item recommendations and
advertising strategy. In addition, dynamic feature
extraction might also provide richer user pro le.
In various applications, one might want to harness
metadata for di erent purposes. When metadata
contains useful information for the topic analysis, it can
help enhance the precision of the model. For instance,
authorship can be used as an indicator of the topics in
scienti c paper analyzing [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Citations can also help
reveal the paper's topics [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In behavior modeling,
metadata such as user id could be used for
personalized analysis.
      </p>
      <p>In this paper, we propose topic evolution model
incorporating metadata e ects, named
metadataincorporated dynamic topic model (mDTM). This is
a exible model e ective for various metadata types
and evolution patterns. We demonstrate its
applicability by modeling the topic evolution of Twitter data,
where we use hashtags as the metadata. This
problem is particularly challenging because of the limited
length of tweets and their non-standard webish style.
Later we use authors as the metadata to run a dynamic
author-interest analysis on the NIPS corpus.
The paper is organized as following. Section 2 gives a
brief description of backgrounds and prior work. Our
model is introduced in Section 3. Finally, the
illustrative examples of topic evolution analysis are presented
in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Notations and Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <p>
        In this paper, the corpus is denoted by D, and each
document d in corpus consists of Nd words. Each word
w is an element in the vocabulary of size V . There
are K di erent topics associated with the corpus.
Assume the words in the same document are
exchangeable. The case of interests is when the documents have
other special metadata. We use h to represent the
metadata. Assume h 2 H, where H is the domain
of h. For instance, when h is hashtag of a tweet, H
can be all the strings of hashtags. Let hd be the
instantiation of h 2 H at document d. Now with above
notations, we can de ne the topics to be probability
distributions over the vocabulary. Let p(wjz) be the
probability of word w appears when the topic is z, then
topic z is represented by a V -vector corresponding to
a multinomial distribution:
(p(1jz); p(2jz)
; p(V jz)):
Latent Dirichlet Allocation proposed by Blei et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
is one of the most popular models for topic analysis.
LDA assumes the documents are generated by the
following process:
(i) for each topic k = 1;
; K :
Draw word distribution by k
Dir( ):
(ii) for each document d in the corpus :
(a) Draw a vector of mixture proportion by
      </p>
      <p>Dir( ):
d
(b) for each word position j in d :
(b1): Draw a topic for the position by
zd;j</p>
      <p>mult( d):
wd;j</p>
      <p>
        mult( zd;j ):
(b2): Draw a word for the position by
In the process, is a K-vector and is a V -vector.
d's are K-vectors characterizing a multinomial
distribution of the topic mixture for each document d.
and are called hyperparameters. Throughout this
paper, we will use wd;j and zd;j to denote the word
and topic in position j of document d respectively.
Dir( ) denotes the Dirichlet distribution with
parameter , and mult( ) denotes the 1-trial multinomial
distribution.The model structure of LDA is shown in
Figure 1(a), where we use to represent the
vectors f 1 K g. In most cases, and are chosen
to be symmetric vectors. There is work (Wallach et
al.[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) showing that LDA with asymmetric
hyperparameters can outperform symmetric settings. For a
K-vector = ( 1; ; K ), they added the prior of
so as Dir( ) can connect LDA to mixture model
given by Hierarchical Dirichlet Process (HDP), which
(a) Original LDA graphical structure
      </p>
      <p>
        (b) Asymmetric LDA with priors.
is a nonparametric prior allocation process proposed
in Teh et al.[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Adding the extra prior , the
graphical structure of LDA can be represent by Figure 1(b).
As mentioned in Section 1, we would like to take
metadata into consideration as in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Labeled-LDA
(Ramage et al.[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]) provides another method to use
metadata, requiring topics to be chosen from a subset of
the label set, where the labels can incorporate certain
kinds of metadata. Statistically speaking, this works
like adding sparse mixture priors. In Ramage et al.[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
labeled-LDA is used for Twitter data. However, there
is no natural way to create labels for di erent
metadata. Such models assume speci c generative process
for metadata in uences, which often limits the model
to certain metadata. However, in our model, the
impacts of metadata are modeled by empirical
estimation rather than a speci c probabilistic process, which
makes it valid generally.
      </p>
      <p>
        On the other hand, we need dynamic models to analyze
topic evolution. The dynamic topic model (DTM)
proposed by Blei works well on the example of science
papers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, its logistic Gaussian assumption is
no longer conjugate to multinomial distribution, which
makes the computation ine cient. Moreover, it is an
o ine model that needs the entire corpus at one time,
thus not suitable for stream data. Iwata et al.[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] uses
multi-scale terms to incorporate time relation. This
method can be very complicated in some cases and
therefore infeasible for large scale datasets. But the
idea of modeling the relation by hyperparameter is
really e ective in many problems. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a time-varying
user model (TVUM) is proposed. It considers users'
behaviors over time, and connects di erent users by
the general sampling process. Here we can take a
different viewpoint of TVUM. Note that when we take
each user's identity as the metadata, TVUM is
actually using metadata for interests evolution. In this
aspect, it can be seen as a special case and also a starting
point of our model.
      </p>
      <p>
        In the next section, we begin from another view of
LDA model, and generalize it to incorporate metadata.
The inference of LDA can be done through MCMC
sampling. The sampling inference algorithm was
proposed in Gri ths et al.[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. But to understand how
LDA works, we need to use the smoother version
shown in Figure 1. It is shown in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] that LDA in
this case limits to a HDP mixture model as K ! 1.
Thus we will introduce a few more notations and start
from HDP aspect of LDA. In the rest of the paper, we
will use subscript d to denote relevant variables
associated with document d, subscript k to denote the
variable associated with topic k, and w to denote variables
associated with word w. Following this style, mk is
de ned as the number of documents containing words
generated from topic k and m = (m1; m2; ; mK ),
while nd;k;w is the number of words w in document
d that is from topic k. We further use to
denote summation over a speci c variable, so n ;k;w is
the number of occurrence of words w being drawn
from topic k and, nk = (n ;k;1; n ;k;2; ; n ;k;V ). In
addition, nd;k; is the number of words in d which
are associated with topic k. When we want to
discuss the variables at time t, we use the
superscript xt to represent the variable x in the model of
time t. So we have mt = (mt1; mt2; ; mtK ),ntk =
(nt;k;1; nt;k;2; ; nt;k;V ). When we focus on
discussions at one time slice, which is clear in context, we
will ignore the superscript t.
      </p>
      <p>
        According to the discussion in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and the mechanism
of Gibbs sampler, we can equivalently de ne the LDA
inference of topic z (for each position of each
document) as the stationary distribution of a Markov chain
with the transition probability given in Formula (1),
in which the superscript (d; j) refers to the originally
de ned variables without considering the position j of
document d, and w (d;j); z (d;j) are the variables of
the words and topics of the corpus except the ones at
position j in document d, that is, wd;j and zd;j
respectively.
      </p>
      <p>The interpretation of this transition probability is that
the Markov chain evolves with the following two
patterns to arrive new topic states in the document. (i)
Choose a topic proportional to the existing topics
distribution within the document. This means it tends
to keep the topic of each position consistent with the
document contents. (ii) With certain probability, it
might choose a topic ignoring the existing contents of
the document. However, this choice is based on the
popularity of topics over the entire corpus. This is a
reasonable assumption in many circumstances, and we
believe this could explain the power of LDA.</p>
      <p>P (zd;j = kjw (d;j); z (d;j)) /
(nd;(kd;;j) +</p>
      <p>mk +
P mk +
k )P (wd;j j k): (1)
k
3.2</p>
      <sec id="sec-3-1">
        <title>Generalization: mDTM</title>
        <p>Assume the corpus has metadata h. Our basic
assumption is that metadata is a good indicator of topics
for each document. For example, a tweet with
hashtag \#Microsoft" is much more likely to talk about
technology rather than sports. Nearly all the
previous works involving a certain type of metadata rely
on this assumption. We rst de ne the preferences of
metadata over time as a vector function of t and h,
g(h; t) = (g1(h; t); g2(h; t); gK (h; t)). The kth
element gk(h; t) is the preference of h to topic k at time
t. Since we want to build a dynamic model for topic
evolution, we can learn g(h; t), and turn it into another
impact on top of the evolutionary e ects of and .
Motivated by the de nition of LDA given by (1), we
de ne the mDTM inference at a xed time slice to
be the stationary distribution of a Markov chain with
transition probability</p>
        <p>P (zd;j = kjw (d;j); z (d;j) /</p>
        <p>mk +
P mk +
t
(nd;(kd;;j) + gk(hd; t) + k t )P (wd;j j tk): (2)
k
The modi cation we make has exact e ects that we
want to incorporate into (1). In addition, this process
provided by mDTM is simple and does not incure too
much computation, as shown in the Section 3.4. We
only focus on the case where there is only one
metadata variable in our discussion. There might be the
case that more than one metadata variables are
associated with the corpus. For instance, we might have
timezone and browser for web log. In this case, we
can simply model the e ects as additive and estimate
the function g(h; t) separately for each metadata
variable. Then everything we discuss here could be used
for multiple metadata variable case. As we will
propose di erent evolution patterns for the parameters in
later sections, here we introduce notation f and f
as the evolution functions of and . Now taking the
time e ects of evolution into consideration, the entire
evolution process of mDTM is as follows:
(1) t = 0: initialize the model by LDA.
(2) For t &gt; 0:
(a) Draw
t according to the model of t
1
by
t = f (t</p>
        <p>1):
kt = f (t</p>
        <p>
          1):
(b) For each topic k = 1;
; K : Draw
k by
(c) With the current t and f ktgkK=1,
implement the inference for the process
described by equation (2):
We model the evolution of all the e ects by separable
steps, so the model can be updated when data in new
time slice arrives, which makes it possible for stream
data processing and online inference. It is very exible
to adjust mDTM for di erent types of metadata, with
di erent properties as we do not have to assume
speci c properties of the metadata. Notice that though
we generalize the Markov chain de nition of LDA to
mDTM, we haven't shown the existence of the
stationary distribution or limiting behavior of the chain. To
address this issue, we can check mixing of the chain,
so as to know if the inference is valid. In all of our
experiments, such validity is observed. For details and
methods about mixing behavior of Markov chains, we
refer to Levin et al. (2009) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>The evolution patterns f (t), f (t) and g(h; t) are
addressed in Section 3.3. Then we give the inference
steps of mDTM in Section 3.4.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Evolution Patterns of mDTM</title>
        <p>Now we describe how to model g. Assume metadata is
categorical which is the case we normally encounter in
applications. Similar methods can be used to choose
f and f , so we will only discuss the evolution
patt
tern for g(h; t) in detail. We use n~k;h to denote the
number of the topic k that occurs in all documents
having metadata h at time t.
3.3.1</p>
      </sec>
      <sec id="sec-3-3">
        <title>Time-decay Weighted Evolution</title>
        <p>We can just take gk as the weighted average number
of topics k appearing in documents with metadata h,
using the weights decays over time. This represents
our belief that the recent information is more useful
to predict the preference. Thus,
gk(h; t) =</p>
        <p>X t sn~sk;h;
s&lt;t
(3)
where is a scalar representing the in uence of the
metadata. This is a straightforward way to encode the
evolution pattern, and the computation is very easy.
3.3.2</p>
      </sec>
      <sec id="sec-3-4">
        <title>Bayesian Posterior Evolution</title>
        <p>For each h 2 H, we assume there is a preference vector
for h to be th = ( t1;h; t2;h; tK;h) which is a vector
in the K 1 dimensional simplex, with tk;h 0 for
k = 1 K. Then the realization of choosing topic
for any h 2 H can be seen as (n~t1;h; n~t2;h; n~tK;h)
Multinomial(n~t ; th), the n~th-trial multinomial
distrih
bution which is sum of n~th independent trials from
mult( th), where n~th is the total number of
observations of h over the corpus. So we can take the Bayesian
estimation by adding a Dirichlet prior by the process:
t
h</p>
        <p>Dir( t 1 ^th 1)
(n~t1;h; n~t2;h;</p>
        <p>t
n~K;h)</p>
        <p>Multinomial(n~th; th)
In such settings, we can choose the posterior
expectation as the estimator, which is</p>
        <p>t
^k;h =</p>
        <p>t t 1
n~k;h + t 1 ^k;h</p>
        <p>t t 1 :
P n~k;h + t 1 ^k;h
(4)
is a scalar representing in uence of the prior, which
is the Bayesian estimator from previous time. Then
let
gk(h; t) =</p>
        <p>t
^k;h
in the process, where is a scalar representing the
in uence of the metadata. Such evolution pattern is
very simple and smooth and it adds almost no
additional computation cost.</p>
        <p>This pattern actually also assumes there is a
hyperparameter in each time t, which is th. Rather than
setting it beforehand, we impute the estimate for such
hyperparameters by inference from the model. This is
the idea of empirical Bayes method. In particular, one
could notice that if there is no new data for h after
time t, Bayesian posterior evolution would remain the
same, while the time-decay evolution gradually shrinks
g to zero.
3.3.3</p>
      </sec>
      <sec id="sec-3-5">
        <title>Sparse Preference</title>
        <p>In certain cases, we might constrain each document
to only choose a small proportion of K topics. Our
method to achieve this goal is to force sparsity on the
topic choosing process. We can take the occasional
appearance of most of the topics as noise, then
implement a thresholding to denoise and get the true sparse
preference. De ne the function S(a; ) as hard or soft
thresholding operator where is the threshold. Then
we can process each variate of the vector resulting from
the previous evolution pattern by S, resulting a sparse
vector. The soft and hard thresholding functions are
de ned respectively as</p>
        <p>Ssoft(a; ) = sign(a) maxfjaj
; 0g</p>
        <p>Shard(a; ) = sign(a) Ifjaj &gt; g
3.3.4</p>
      </sec>
      <sec id="sec-3-6">
        <title>Choice of f and f</title>
        <p>Similar evolution patterns for f and f can be
chosen. With certain variable changed according to the
settings. For f , one could use mtk to replace n~k;h in
t
(3) and (4). The evolution pattern of can be derived
via replacing n~k;h in (3), (4) by ntk .</p>
        <p>t
3.4</p>
      </sec>
      <sec id="sec-3-7">
        <title>Inference</title>
        <p>As mentioned previously, mDTM can be seen as a
generalization of TVUM. Suppose now we take user-ID as
the only metadata which is categorical, and assume
that each document belongs to a certain user-ID, then
the parameters associated with each category of the
metadata in mDTM become the parameters associated
with a particular user. Furthermore, suppose that the
documents are the browsing history of a user, then
mDTM will be modeling the user's browsing behavior
over time. In particular, if we use the time-decay
average discussed in Section 3.3.1, the resulting model
is equivalent to TVUM after some simple derivation
1. This connection gives an vivid example about how
to transform a speci c problem into the settings of
mDTM.</p>
        <p>
          The time-varying relationship of mDTM can be
represented by a separable term, thus we can incorporate
the time-related term and the topic modeling for a
xed time separately. For a xed time unit, the
inference process by Gibbs sampling is easy to derive.
Since the special case mentioned before is equivalent
to TVUM, we derive the inference process by analogy
to that shown in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Suppose now we have the model
in previous time t 1, the whole process for inference
of t is as follows:
(i) Update the new hyperparameters t and
time t according to the chosen evolution pattern.
t for
(ii) Initially set the starting values. We could set the
initial value of as t. The initial values for the counts
at time t, that is mtk; nt;k;w; nd;k; , can be computed
after randomly choosing topics for each documents and
words.
(iii) For each document d, compute the g(hd; t)
according to the chosen evolution pattern in Section 3.3.
Then sample the topic for each word position j by the
formula
        </p>
        <p>
          P (zd;j = kjwd;j ; others)
/ (nd;(kd;;j)+gk(hd; t)+
d;k) PnVwt=;;k1;w(ndd;t;j;;jk);w(+d;j)kt+;wd;jkt;w
:
(iv) Sample mtk from the Antoniak distribution [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for
Dirichlet process.
(v) Sample
        </p>
        <p>from Dir(mt + t). And repeat (iii)-(v).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>To illustrate the model, we conducted two experiments
in which metadata is used for di erent purposes. We
rst use mDTM on Twitter data for topic analysis, in
which we take hashtags as the metadata. In the
second experiment, we t our model on the NIPS paper
corpus and try to nd information for speci c authors,
1Actually, TVUM has a slightly di erent way to de ne the
evolution of , which de nes the average in di erent scales of time,
such as daily, weekly and monthly average.
which we use as metadata. For conciseness, we mainly
discuss the former in detail, because Twitter data is
special and challenging for topic analysis. In the NIPS
analysis, on top of the similar results as in previous
dynamic models such as DTM, we can extract authors'
interests evolution pattern, which would be the main
result we present for that experiment.
4.1</p>
      <sec id="sec-4-1">
        <title>Twitter Topic Analysis 4.1.1</title>
      </sec>
      <sec id="sec-4-2">
        <title>Data and Model Settings</title>
        <p>
          The Twitter data in the experiment is from the paper
of Yang and Leskovec [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. We use the English tweets
from July 1st, 2009 to August 31st, 2009. For each
of the rst three days, we randomly sampled 200,000
tweets from the dataset. And around 100,000 tweets
were sampled for each of the rest days. We considered
the hashtags as the metadata in the experiment. After
ltering stop words and ignoring all words appearing
less than 10 times, a vocabulary of 12,000 words is
selected by TF-IDF ranking. The number of topics
was xed at 50. In mDTM, time-decay weighted
average was used for f and f . We simply set = 0:3.
Bayesian posterior evolution was used for hashtag and
soft-thresholding discussed in Section 3.3.3 was used
for the evolution of g(hd; t). The parameters and
are tuned according to the prediction performance in
the rst week, which is discussed in Section 4.1.4.
Our main interest is how topic popularity and contents
change over time.
4.1.2
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Topic Popularity Evolution</title>
        <p>As can be seen from Equation (2), all the documents
with di erent metadata share the common term m,
thus m can be interpreted as community popularity
of topics, separated from the speci c preference of
metadata. This shows which topics are more
popular on Twitter. Figure 2 gives popularity over the two
months of some topics, which we labeled manually
after checking the word distributions of the topics.
4.1.3</p>
      </sec>
      <sec id="sec-4-4">
        <title>Topic Contents Evolution</title>
        <p>Since each topic is represented by a multinomial
distribution, one could nd out the important words of the
topics. Table 1 gives the content evolution of the topic
US politics. It can be seen that obama and tcot2 are
very important words. However, words about \Sarah
2The word tcot represents \top conservatives on twitter".</p>
        <p>
          Palin" were mainly popular in July, while the words
about \Kennedy" and \Glenn Beck" became popular
only at the end of August, all of which roughly match
the pattern of search frequencies given by Google
Trends3.
There is no standard method to evaluate dynamic
topic models, thus we take a similar approach as in
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to show the prediction performance on the
heldout data. In each day, we treat the next day's data as
the held-out data and measure the prediction power of
the model.
        </p>
        <p>
          We compare mDTM with two LDA models without
metadata as in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] to illustrate the improvement
provided by metadata modeling4. Without metadata, in
the rst model, we use LDA on the data of each day
for inference, and call this model indLDA. The
prob3We don't provide the results from Google Trends due to
the limited space. The search frequencies can be found at
www.google.com/trends/
        </p>
        <p>
          4We didn't compare directly with DTM. This is because DTM
cannot be used in an online way, thus it cannot serve our purpose.
lem here is that there is no clear association for topics
between days. In the second one, we try to overcame
this drawback and take all the data of previous days
for inference, which we call LDA-all. It would take
nearly two months' data at the end of the period. This
would be too much for computation. Thus we further
subsampled the data from previous days for LDA-all
in the end of the period to make it feasible. LDA-all
will not serve for our purpose and so the main
interests would be comparing indLDA and mDTM. We
report the negative log-likelihood on the held-out data
computed as discussed by Wallach et al[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] over the
beginning period (July 4th - 10th) and the end period
(Aug 21st - 30th). We estimate mDTM as discussed
before, but computed the negative log-likelihood
ignoring the metadata of the held-out data, thus this gives
us an idea of how metadata can improve the modeling
for general documents, even those without metadata.
There is in all of the three models. We tune it and
the thresholding parameter by achieving the best
loglikelihood in the rst week. Figure 3 and 4 illustrate
the results.
        </p>
        <p>
          As is shown, mDTM always performs better than the
other two models. This is not surprising because
mDTM has more exible priors. It is interesting that
LDA-all performs even worse than indLDA. This is
di erent from the results of [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. It might be explained
by the di erences between Twitter data and scienti c
paper data. Twitter's topic changes so frequently, but
LDA-all takes all the previous days together, which
undermines its power.
4.1.5
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>E ects of Metadata</title>
        <p>In Twitter analysis, the topic preference of a speci c
hashtag is not of interests. However, incorporating
hashtags can improve the preformance. On average,
there are roughly 10 precent of the tweets having
hashtags. But such a small proportion of metadata is able
to provide important improvement of the whole
corpus, even for the tweets without hashtags. We
compute the held-out log-likelihood, for both the model
inferred without using hashtags as metadata (called
DTM noTag) and the model mDTM using hashtags.
mDTM noTag can be seen as TVUM with one user.
Note that when compute the held-out log-likelihood.
We take the improvement of hashtags as the
improvement of negative log-likelihood
( loglik)DTM noTag
( loglik)mDTM:
Figure 5 illustrates the improvement of negative
logikelihood on the held-out data over the period. It
can be seen that on average, incorporating hashtags
as metadata does improve the performance. And this
improvement tends to grow as time goes. This might
results from the better estimation of most of the
metadata preference.
4.1.6</p>
      </sec>
      <sec id="sec-4-6">
        <title>Running Times</title>
        <p>
          Here we present a comparison for timing of mDTM and
indLDA. Both were implemented in C++, running
under Ubuntu 10.04, with Quad core AMD Opteron
Processor and 64 GB RAM. We list average running times
(rounded) in Table 2. indLDA is the average time on
10 days (July 4th - July 13th) with 600 sampling
iterations each day. mDTM-1 is the mDTM running on the
same data with 600 sampling iterations. Since mDTM
could inherit information from previous time, we found
300 iterations (or less) are enough for valid inference.
Thus we use mDTM-2 to denote mDTM with 300
iterations. It can be seen that mDTM is much faster
than LDA.
The previous sections show that mDTM is better than
indLDA and LDA-all at generality. However, the
interpretability of the topics is also of interests. Chang
et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] revealed that models with better performance
on held-out likelihood might have poor
interpretability. Here we use the method in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to ask humans
to evaluate the interpretability. We choose July 4th
(the rst day after three initial days), July 11th
(after one week of July 4th), August 30th (the last day)
and August 23th (one week before the end) for
experiments. However, news would be di cult for people to
recognize after more than one year, so we only chose
10 stable topics from each model5. For every topic in
each model, we construct the list by permuting top
15 words for that topic together with 5 intruder words
which have low probability in that topic but high
probability in some other topics. Suppose we have S
subjects, then for each topic k, we compute the average
correct ratio (ACR)
        </p>
        <p>ACR(k) =</p>
        <p>
          S
X C(s; k)=(5S);
s=1
where C(s; k) is the number of correct intruders
chosen by subject s for topic k. We conducted a human
evaluation experiment on Mechanical Turk with 150
subjects in total. Figure 6 shows the boxplot of ACR
distribution within each model on each day.
It can be seen that mDTM does not lose much
interpretability despite its better prediction performance,
which is di erent from the observations in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We
hypothesize that this is due to the impacts of metadata.
4.2
        </p>
      </sec>
      <sec id="sec-4-7">
        <title>NIPS Topic Analysis</title>
        <p>In this section, we illustrate a di erent application
of mDTM, that is, to extract speci c information of
metadata.</p>
        <p>
          5We count the number of di erent words in the top 20 words list
on two consecutive days, and sum such numbers during the whole
period together. A larger sum number means that the topic word
list changes frequently. Then we select 10 topics that are the most
stable. The topics in di erent time are not associated for indLDA
and LDA-all. We connect a pair of topics between two consecutive
days if they have the most overlap on top 20 words.
The dataset for this experiment contains the text le
of NIPS conference from 1987 to 2003 in Globerson
et al[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]6. We only use the text of the paper and take
the authors as the metadata. The papers in 1987-1990
were used for the rst time unit to initiate the model,
and each year after that was taken as a new time unit.
The preprocessing details of the data can be found
on the website. We further deleted all the numbers
and a few stop words. The resulting vocabulary has
10,005 words. The number of topics K was set as 80.
Bayesian posterior evolution was used for g and f .
And f was set as time-decay weighted average with
= 0:3. We don't use sparse preference in this
example. The parameter is again tuned by log-likelihood
as before.
4.2.2
        </p>
      </sec>
      <sec id="sec-4-8">
        <title>Author-topic interests</title>
        <p>
          As before, we could see the topic contents and
popularity trends over time. Here, we only focus on the
special information given by metadata in this
experiment. When taking authors as metadata, an
interesting information result provided by mDTM is the
interests of authors, similar to the results of [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
Figure 7 shows the results given by mDTM for author
\Jordan M". The height of the red bars represents the
^k;h from Equation (4) for h=\Jordan M", which can
be interpreted as the topic interests according to the
past information.
        </p>
        <p>It can be seen that authors' favorite topics remained
nearly the same during the three years, though the
interest level for individual topics varied. When we know
the topic interests of the author, we can further
investigate the contents of the user's favorite topics, which
is a way to detect the user's interests that would be
useful in many applications. Table 3 shows the top 10
words for four topics of signi cant interests to
\Jordan M" in 1999, according to the result in Figure 7.
We can roughly see they are mainly about
\clustering methods", \common descriptive terms",
\graphical models" and \mixture models &amp; density
estimation", which is a reasonable approximation.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we have developed a topic evolution
model that incorporats metadata impacts. Flexible
6Data can be found at http://ai.stanford.edu/ gal/data.html
topic interests 1998
topic interests 1999
40
Topics
40
Topics
40
Topics
60
60
60
80
80
80
evolution patterns are proposed, which can be chosen
according to properties of data and the applications.
We also demonstrate the use of the model on Twitter
data and NIPS data, revealing its advantage with
respect to generality, computation and interpretability.
The work can be extended in many new ways. For the
moment, it cannot model the birth and death of topics.
One way to solve this problem is to use general prior
allocation mechanism such as HDP. There has been
work using this idea for static models. In addition, the
generality and exibility of mDTM make it possible
to build other evolution patterns for hyperparameters,
which might be more suitable for speci c purposes of
modeling.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Low</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Josifovski</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Smola</surname>
          </string-name>
          .
          <article-title>Scalable distributed inference of dynamic user interests for behavioral targeting</article-title>
          .
          <source>In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <source>KDD '11</source>
          , pages
          <fpage>114</fpage>
          {
          <fpage>122</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Antoniak</surname>
          </string-name>
          .
          <article-title>Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems</article-title>
          .
          <source>The Annals of Statistics</source>
          ,
          <volume>2</volume>
          (
          <issue>6</issue>
          ):
          <volume>1152</volume>
          {
          <fpage>1174</fpage>
          ,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          and
          <string-name>
            <surname>J. D.</surname>
          </string-name>
          <article-title>La erty. Dynamic topic models</article-title>
          .
          <source>In Proceedings of the 23rd international conference on Machine learning, ICML '06</source>
          , pages
          <fpage>113</fpage>
          {
          <fpage>120</fpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <surname>and J.</surname>
          </string-name>
          <article-title>La erty. Latent dirichlet allocation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Boyd-Graber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gerrish</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          . Reading tea leaves:
          <article-title>How humans interpret topic models</article-title>
          .
          <source>In Neural Information Processing Systems</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          .
          <article-title>A bayesian hierarchical model for learning natural scene categories</article-title>
          .
          <source>CVPR</source>
          , pages
          <volume>524</volume>
          {
          <fpage>531</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Globerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chechik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Tishby</surname>
          </string-name>
          .
          <article-title>Euclidean Embedding of Co-occurrence Data</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>8</volume>
          :
          <fpage>2265</fpage>
          {
          <fpage>2295</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>T. L.</surname>
          </string-name>
          <article-title>Gri ths and M. Steyvers. Finding scienti c topics</article-title>
          .
          <source>Proceedings of the National Academy of Sciences of the United States of America</source>
          ,
          <volume>101</volume>
          (
          <issue>Suppl 1</issue>
          ):
          <volume>5228</volume>
          {
          <fpage>5235</fpage>
          ,
          <string-name>
            <surname>Apr</surname>
          </string-name>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mitra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Giles</surname>
          </string-name>
          .
          <article-title>Detecting topic evolution in scienti c literature: how can citations help? In Proceeding of the 18th ACM conference on Information and knowledge management</article-title>
          ,
          <source>CIKM '09</source>
          , pages
          <fpage>957</fpage>
          {
          <fpage>966</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Iwata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yamada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sakurai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Ueda</surname>
          </string-name>
          .
          <article-title>Online multiscale dynamic topic models</article-title>
          .
          <source>In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <source>KDD '10</source>
          , pages
          <fpage>663</fpage>
          {
          <fpage>672</fpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Levin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peres</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Wilmer</surname>
          </string-name>
          .
          <article-title>Markov chains and mixing times</article-title>
          .
          <source>American Mathematical Society</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Liebling</surname>
          </string-name>
          .
          <article-title>Characterizing microblogs with topic models</article-title>
          .
          <source>In ICWSM</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nallapati</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          . Labeled LDA:
          <article-title>A supervised topic model for credit attribution in multi-labeled corpora</article-title>
          .
          <source>In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <volume>248</volume>
          {
          <fpage>256</fpage>
          ,
          <string-name>
            <surname>Singapore</surname>
          </string-name>
          ,
          <year>August 2009</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosen-Zvi</surname>
          </string-name>
          , T. Gri ths, M. Steyvers, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          .
          <article-title>The author-topic model for authors and documents</article-title>
          .
          <source>In Proceedings of the 20th conference on Uncertainty in arti cial intelligence</source>
          ,
          <source>UAI '04</source>
          , pages
          <fpage>487</fpage>
          {
          <fpage>494</fpage>
          ,
          <string-name>
            <surname>Arlington</surname>
          </string-name>
          , Virginia, United States,
          <year>2004</year>
          . AUAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y. W.</given-names>
            <surname>Teh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Beal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          .
          <article-title>Hierarchical Dirichlet processes</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>101</volume>
          (
          <issue>476</issue>
          ):
          <volume>1566</volume>
          {
          <fpage>1581</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          .
          <article-title>Rethinking lda: Why priors matter</article-title>
          . In Y. Bengio,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          , J. La erty,
          <string-name>
            <surname>C. K. I. Williams</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Culotta, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>22</volume>
          , pages
          <year>1973</year>
          {
          <year>1981</year>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>H. M. Wallach</surname>
            ,
            <given-names>I. Murray</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          .
          <article-title>Evaluation methods for topic models</article-title>
          .
          <source>In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09</source>
          , pages
          <fpage>1105</fpage>
          {
          <fpage>1112</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <article-title>Patterns of temporal variation in online media</article-title>
          .
          <source>In Proceedings of the fourth ACM international conference on Web search and data mining</source>
          ,
          <source>WSDM '11</source>
          , pages
          <fpage>177</fpage>
          {
          <fpage>186</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>