=Paper=
{{Paper
|id=None
|storemode=property
|title=Incorporating Metadata into Dynamic Topic Analysis
|pdfUrl=https://ceur-ws.org/Vol-962/paper05.pdf
|volume=Vol-962
|dblpUrl=https://dblp.org/rec/conf/uai/LiKWK12
}}
==Incorporating Metadata into Dynamic Topic Analysis==
<pdf width="1500px">https://ceur-ws.org/Vol-962/paper05.pdf</pdf>
<pre>
              Incorporating Metadata into Dynamic Topic Analysis


         Tianxi Li               Branislav Kveton                        Yu Wu                  Ashwin Kashyap
    Stanford University              Technicolor                   Stanford University             Technicolor
    Stanford, CA 94305           Palo Alto, CA 94301               Stanford, CA 94305          Palo Alto, CA 94301
    tianxili@stanford.edu   Branislav.Kveton@technicolor.com       yuw2@stanford.edu     Ashwin.Kashyap@technicolor.com


                       Abstract                                as in Ahmed et al. [1] In such circumstances, topic
                                                               evolution gains other practical values. For example,
      Everyday millions of blogs and micro-blogs               knowing the evolution of people’s behaviors could im-
      are posted on the Internet These posts usu-              prove the performance of item recommendations and
      ally come with useful metadata, such as tags,            advertising strategy. In addition, dynamic feature ex-
      authors, locations, etc. Much of these data              traction might also provide richer user profile.
      are highly specific or personalized. Track-              In various applications, one might want to harness
      ing the evolution of these data helps us to              metadata for different purposes. When metadata con-
      discover trending topics and users’ interests,           tains useful information for the topic analysis, it can
      which are key factors in recommendation and              help enhance the precision of the model. For instance,
      advertisement placement systems. In this pa-             authorship can be used as an indicator of the topics in
      per, we use topic models to analyze topic evo-           scientific paper analyzing [14]. Citations can also help
      lution in social media corpora with the help of          reveal the paper’s topics [9]. In behavior modeling,
      metadata. Specifically, we propose a flexible            metadata such as user id could be used for personal-
      dynamic topic model which can easily incor-              ized analysis.
      porate various type of metadata. Since our
      model adds negligible computation cost on                In this paper, we propose topic evolution model
      the top of Latent Dirichlet Allocation, it can           incorporating metadata effects, named metadata-
      be implemented very efficiently. We test our             incorporated dynamic topic model (mDTM). This is
      model on both Twitter data and NIPS pa-                  a flexible model effective for various metadata types
      per collection. The results show that our ap-            and evolution patterns. We demonstrate its applica-
      proach provides better performance in terms              bility by modeling the topic evolution of Twitter data,
      of held-out likelihood, yet still retains good           where we use hashtags as the metadata. This prob-
      interpretability.                                        lem is particularly challenging because of the limited
                                                               length of tweets and their non-standard webish style.
                                                               Later we use authors as the metadata to run a dynamic
1     Introduction                                             author-interest analysis on the NIPS corpus.

Topic evolution analysis has become increasingly im-           The paper is organized as following. Section 2 gives a
portant in recent years. Such analysis on social me-           brief description of backgrounds and prior work. Our
dia and webpages could help people understand in-              model is introduced in Section 3. Finally, the illustra-
formation spreading better. In addition, it also pro-          tive examples of topic evolution analysis are presented
vides ways to understand latent patterns of corpus,            in Section 4.
reduce effective dimensionality and classify documents
and data. Meanwhile, reseachers manage to fit vari-            2     Notations and Related Work
ous types of data into the topic model. For example,
image segmentations was modeled as topics in Feifei            In this paper, the corpus is denoted by D, and each
et al. [6]. User behaviors were also modeled as topics         document d in corpus consists of Nd words. Each word
w is an element in the vocabulary of size V . There
are K different topics associated with the corpus. As-
sume the words in the same document are exchange-
able. The case of interests is when the documents have
other special metadata. We use h to represent the
metadata. Assume h ∈ H, where H is the domain
of h. For instance, when h is hashtag of a tweet, H
can be all the strings of hashtags. Let hd be the in-
                                                                     (a) Original LDA graphical structure
stantiation of h ∈ H at document d. Now with above
notations, we can define the topics to be probability
distributions over the vocabulary. Let p(w|z) be the
probability of word w appears when the topic is z, then
topic z is represented by a V -vector corresponding to
a multinomial distribution:

               (p(1|z), p(2|z) · · · , p(V |z)).                      (b) Asymmetric LDA with priors.
 Latent Dirichlet Allocation proposed by Blei et al.[4],
is one of the most popular models for topic analysis.          Figure 1: Graphical structures of LDA models.
LDA assumes the documents are generated by the fol-
lowing process:
                                                           is a nonparametric prior allocation process proposed
   (i) for each topic k = 1, · · · , K :                   in Teh et al.[15]. Adding the extra prior Ω, the graph-
                                                           ical structure of LDA can be represent by Figure 1(b).
      Draw word distribution by φk ∼ Dir(β).
   (ii) for each document d in the corpus :
                                                           As mentioned in Section 1, we would like to take meta-
      (a) Draw a vector of mixture proportion by
                                                           data into consideration as in [14]. Labeled-LDA (Ra-
          θd ∼ Dir(α).
                                                           mage et al.[13]) provides another method to use meta-
      (b) for each word position j in d :                  data, requiring topics to be chosen from a subset of
          (b1): Draw a topic for the position by           the label set, where the labels can incorporate certain
                     zd,j ∼ mult(θd ).                     kinds of metadata. Statistically speaking, this works
                                                           like adding sparse mixture priors. In Ramage et al.[12],
          (b2): Draw a word for the position by
                                                           labeled-LDA is used for Twitter data. However, there
                     wd,j ∼ mult(φzd,j ).                  is no natural way to create labels for different meta-
                                                           data. Such models assume specific generative process
 In the process, α is a K-vector and β is a V -vector.
                                                           for metadata influences, which often limits the model
θd ’s are K-vectors characterizing a multinomial distri-
                                                           to certain metadata. However, in our model, the im-
bution of the topic mixture for each document d. α
                                                           pacts of metadata are modeled by empirical estima-
and β are called hyperparameters. Throughout this
                                                           tion rather than a specific probabilistic process, which
paper, we will use wd,j and zd,j to denote the word
                                                           makes it valid generally.
and topic in position j of document d respectively.
Dir(α) denotes the Dirichlet distribution with param-      On the other hand, we need dynamic models to analyze
eter α, and mult(θ) denotes the 1-trial multinomial        topic evolution. The dynamic topic model (DTM) pro-
distribution.The model structure of LDA is shown in        posed by Blei works well on the example of science pa-
Figure 1(a), where we use Φ to represent the vec-          pers [3]. However, its logistic Gaussian assumption is
tors {φ1 · · · φK }. In most cases, α and β are chosen     no longer conjugate to multinomial distribution, which
to be symmetric vectors. There is work (Wallach et         makes the computation inefficient. Moreover, it is an
al.[16]) showing that LDA with asymmetric hyperpa-         offline model that needs the entire corpus at one time,
rameters can outperform symmetric settings. For a          thus not suitable for stream data. Iwata et al.[10] uses
K-vector Ω = (Ω1 , · · · , ΩK ), they added the prior of   multi-scale terms to incorporate time relation. This
α so as α ∼ Dir(Ω) can connect LDA to mixture model        method can be very complicated in some cases and
given by Hierarchical Dirichlet Process (HDP), which       therefore infeasible for large scale datasets. But the
idea of modeling the relation by hyperparameter is re-             ment) as the stationary distribution of a Markov chain
ally effective in many problems. In [1], a time-varying            with the transition probability given in Formula (1),
user model (TVUM) is proposed. It considers users’                 in which the superscript −(d, j) refers to the originally
behaviors over time, and connects different users by               defined variables without considering the position j of
the general sampling process. Here we can take a dif-              document d, and w−(d,j) , z−(d,j) are the variables of
ferent viewpoint of TVUM. Note that when we take                   the words and topics of the corpus except the ones at
each user’s identity as the metadata, TVUM is actu-                position j in document d, that is, wd,j and zd,j respec-
ally using metadata for interests evolution. In this as-           tively.
pect, it can be seen as a special case and also a starting
                                                                   The interpretation of this transition probability is that
point of our model.
                                                                   the Markov chain evolves with the following two pat-
                                                                   terns to arrive new topic states in the document. (i)
In the next section, we begin from another view of
                                                                   Choose a topic proportional to the existing topics dis-
LDA model, and generalize it to incorporate metadata.
                                                                   tribution within the document. This means it tends
                                                                   to keep the topic of each position consistent with the
3     Metadata-incorporated Dynamic                                document contents. (ii) With certain probability, it
      Topic Model                                                  might choose a topic ignoring the existing contents of
                                                                   the document. However, this choice is based on the
3.1    Motivation: Define LDA via Markov                           popularity of topics over the entire corpus. This is a
       Chains                                                      reasonable assumption in many circumstances, and we
                                                                   believe this could explain the power of LDA.
The inference of LDA can be done through MCMC
sampling. The sampling inference algorithm was pro-                  P (zd,j = k|w−(d,j) , z−(d,j) ) ∝
posed in Griffiths et al.[8]. But to understand how                               −(d,j)       mk + Ωk
                                                                                (nd,k,· + λ P            )P (wd,j |φk ). (1)
LDA works, we need to use the smoother version                                                   mk + Ωk
shown in Figure 1. It is shown in [15] that LDA in
this case limits to a HDP mixture model as K → ∞.
Thus we will introduce a few more notations and start              3.2    Generalization: mDTM
from HDP aspect of LDA. In the rest of the paper, we               Assume the corpus has metadata h. Our basic as-
will use subscript d to denote relevant variables associ-          sumption is that metadata is a good indicator of topics
ated with document d, subscript k to denote the vari-              for each document. For example, a tweet with hash-
able associated with topic k, and w to denote variables            tag “#Microsoft” is much more likely to talk about
associated with word w. Following this style, mk is                technology rather than sports. Nearly all the previ-
defined as the number of documents containing words                ous works involving a certain type of metadata rely
generated from topic k and m = (m1 , m2 , · · · , mK ),            on this assumption. We first define the preferences of
while nd,k,w is the number of words w in document                  metadata over time as a vector function of t and h,
d that is from topic k. We further use · to de-                    g(h, t) = (g1 (h, t), g2 (h, t), · · · gK (h, t)). The kth ele-
note summation over a specific variable, so n·,k,w is              ment gk (h, t) is the preference of h to topic k at time
the number of occurrence of words w being drawn                    t. Since we want to build a dynamic model for topic
from topic k and, nk = (n·,k,1 , n·,k,2 , · · · , n·,k,V ). In     evolution, we can learn g(h, t), and turn it into another
addition, nd,k,· is the number of words in d which                 impact on top of the evolutionary effects of β and Ω.
are associated with topic k. When we want to                       Motivated by the definition of LDA given by (1), we
discuss the variables at time t, we use the super-                 define the mDTM inference at a fixed time slice to
script xt to represent the variable x in the model of              be the stationary distribution of a Markov chain with
time t. So we have mt = (mt1 , mt2 , · · · , mtK ),ntk =           transition probability
(nt·,k,1 , nt·,k,2 , · · · , nt·,k,V ). When we focus on discus-
sions at one time slice, which is clear in context, we               P (zd,j = k|w−(d,j) , z−(d,j) ∝
will ignore the superscript t.                                        −(d,j)                   mk + Ωtk
                                                                    (nd,k,· + gk (hd , t) + λ P          )P (wd,j |φtk ). (2)
According to the discussion in [15] and the mechanism                                           mk + Ωtk
of Gibbs sampler, we can equivalently define the LDA                The modification we make has exact effects that we
inference of topic z (for each position of each docu-              want to incorporate into (1). In addition, this process
provided by mDTM is simple and does not incure too          3.3     Evolution Patterns of mDTM
much computation, as shown in the Section 3.4. We
only focus on the case where there is only one meta-        Now we describe how to model g. Assume metadata is
data variable in our discussion. There might be the         categorical which is the case we normally encounter in
case that more than one metadata variables are asso-        applications. Similar methods can be used to choose
ciated with the corpus. For instance, we might have         fΩ and fβ , so we will only discuss the evolution pat-
timezone and browser for web log. In this case, we          tern for g(h, t) in detail. We use ñtk,h to denote the
can simply model the effects as additive and estimate       number of the topic k that occurs in all documents
the function g(h, t) separately for each metadata vari-     having metadata h at time t.
able. Then everything we discuss here could be used
for multiple metadata variable case. As we will pro-        3.3.1      Time-decay Weighted Evolution
pose different evolution patterns for the parameters in     We can just take gk as the weighted average number
later sections, here we introduce notation fΩ and fβ        of topics k appearing in documents with metadata h,
as the evolution functions of Ω and β. Now taking the       using the weights decays over time. This represents
time effects of evolution into consideration, the entire    our belief that the recent information is more useful
evolution process of mDTM is as follows:                    to predict the preference. Thus,
                                                                                         X
                                                                           gk (h, t) = σ   κt−s ñsk,h ,      (3)
                                                                                                 s<t
  (1) t = 0: initialize the model by LDA.
                                                             where σ is a scalar representing the influence of the
  (2) For t > 0:
                                                            metadata. This is a straightforward way to encode the
     (a) Draw Ωt according to the model of t − 1            evolution pattern, and the computation is very easy.
          by Ωt = fΩ (t − 1).
     (b) For each topic k = 1, · · · , K : Draw βk by       3.3.2      Bayesian Posterior Evolution
          βkt = fβ (t − 1).                                 For each h ∈ H, we assume there is a preference vector
     (c) With the current Ωt and {βkt }K
                                       k=1 ,                for h to be µth = (µt1,h , µt2,h , · · · µtK,h ) which is a vector
          implement the inference for the process           in the K − 1 dimensional simplex, with µtk,h ≥ 0 for
                                                            k = 1 · · · K. Then the realization of choosing topic
          described by equation (2).
                                                            for any h ∈ H can be seen as (ñt1,h , ñt2,h , · · · ñtK,h ) ∼
                                                            Multinomial(ñth , µth ), the ñth -trial multinomial distri-
                                                            bution which is sum of ñth independent trials from
                                                            mult(µth ), where ñth is the total number of observa-
We model the evolution of all the effects by separable
                                                            tions of h over the corpus. So we can take the Bayesian
steps, so the model can be updated when data in new
                                                            estimation by adding a Dirichlet prior by the process:
time slice arrives, which makes it possible for stream
data processing and online inference. It is very flexible
to adjust mDTM for different types of metadata, with                               µth ∼ Dir(ζ t−1 · µ̂t−1
                                                                                                       h )
different properties as we do not have to assume spe-
cific properties of the metadata. Notice that though
                                                                    (ñt1,h , ñt2,h , · · · ñtK,h ) ∼ Multinomial(ñth , µth )
we generalize the Markov chain definition of LDA to
mDTM, we haven’t shown the existence of the station-         In such settings, we can choose the posterior expec-
ary distribution or limiting behavior of the chain. To      tation as the estimator, which is
address this issue, we can check mixing of the chain,
                                                                                      ñtk,h + ζ t−1 · µ̂t−1
                                                                                                         k,h
so as to know if the inference is valid. In all of our                         t
                                                                             µ̂k,h = P t                       .                   (4)
experiments, such validity is observed. For details and                                  ñk,h + ζ t−1 · µ̂t−1
                                                                                                           k,h
methods about mixing behavior of Markov chains, we
refer to Levin et al. (2009) [11].
                                                            ζ is a scalar representing influence of the prior, which
The evolution patterns fΩ (t), fβ (t) and g(h, t) are ad-   is the Bayesian estimator from previous time. Then
dressed in Section 3.3. Then we give the inference          let
steps of mDTM in Section 3.4.                                                   gk (h, t) = σ µ̂tk,h
 in the process, where σ is a scalar representing the       over time. In particular, if we use the time-decay av-
influence of the metadata. Such evolution pattern is        erage discussed in Section 3.3.1, the resulting model
very simple and smooth and it adds almost no addi-          is equivalent to TVUM after some simple derivation
                                                            1
tional computation cost.                                      . This connection gives an vivid example about how
                                                            to transform a specific problem into the settings of
This pattern actually also assumes there is a hyper-
                                                            mDTM.
parameter in each time t, which is µth . Rather than
setting it beforehand, we impute the estimate for such      The time-varying relationship of mDTM can be rep-
hyperparameters by inference from the model. This is        resented by a separable term, thus we can incorporate
the idea of empirical Bayes method. In particular, one      the time-related term and the topic modeling for a
could notice that if there is no new data for h after       fixed time separately. For a fixed time unit, the in-
time t, Bayesian posterior evolution would remain the       ference process by Gibbs sampling is easy to derive.
same, while the time-decay evolution gradually shrinks      Since the special case mentioned before is equivalent
g to zero.                                                  to TVUM, we derive the inference process by analogy
                                                            to that shown in [1]. Suppose now we have the model
3.3.3    Sparse Preference                                  in previous time t − 1, the whole process for inference
                                                            of t is as follows:
In certain cases, we might constrain each document
to only choose a small proportion of K topics. Our          (i) Update the new hyperparameters Ωt and β t for
method to achieve this goal is to force sparsity on the     time t according to the chosen evolution pattern.
topic choosing process. We can take the occasional
                                                            (ii) Initially set the starting values. We could set the
appearance of most of the topics as noise, then imple-
                                                            initial value of α as Ωt . The initial values for the counts
ment a thresholding to denoise and get the true sparse
                                                            at time t, that is mtk , nt·,k,w , nd,k,· , can be computed
preference. Define the function S(a, ) as hard or soft
                                                            after randomly choosing topics for each documents and
thresholding operator where  is the threshold. Then
                                                            words.
we can process each variate of the vector resulting from
the previous evolution pattern by S, resulting a sparse     (iii) For each document d, compute the g(hd , t) ac-
vector. The soft and hard thresholding functions are        cording to the chosen evolution pattern in Section 3.3.
defined respectively as                                     Then sample the topic for each word position j by the
                                                            formula
          Ssoft (a, ) = sign(a) · max{|a| − , 0}
                                                                P (zd,j = k|wd,j , others)
            Shard (a, ) = sign(a) · I{|a| > }
                                                                                                     t,−(d,j)
                                                                                                           t
                                                                −(d,j)
                                                                                             n·,k,wd,j + βk,w d,j
                                                            ∝ (nd,k,· +gk (hd , t)+λαd,k )· PV     t,−(d,j)
                                                                                                                   .
3.3.4    Choice of fΩ and fβ                                                                                    t
                                                                                             w=1 n·,k,w     + βk,w
Similar evolution patterns for fΩ and fβ can be cho-
sen. With certain variable changed according to the         (iv) Sample mtk from the Antoniak distribution [2] for
settings. For fΩ , one could use mtk to replace ñtk,h in   Dirichlet process.
(3) and (4). The evolution pattern of β can be derived
via replacing ñtk,h in (3), (4) by ntk .                   (v) Sample α from Dir(mt + Ωt ). And repeat (iii)-(v).


3.4     Inference                                           4       Experiments
As mentioned previously, mDTM can be seen as a gen-         To illustrate the model, we conducted two experiments
eralization of TVUM. Suppose now we take user-ID as         in which metadata is used for different purposes. We
the only metadata which is categorical, and assume          first use mDTM on Twitter data for topic analysis, in
that each document belongs to a certain user-ID, then       which we take hashtags as the metadata. In the sec-
the parameters associated with each category of the         ond experiment, we fit our model on the NIPS paper
metadata in mDTM become the parameters associated           corpus and try to find information for specific authors,
with a particular user. Furthermore, suppose that the           1
                                                                 Actually, TVUM has a slightly different way to define the evo-
documents are the browsing history of a user, then          lution of Ω, which defines the average in different scales of time,
mDTM will be modeling the user’s browsing behavior          such as daily, weekly and monthly average.
which we use as metadata. For conciseness, we mainly
discuss the former in detail, because Twitter data is
special and challenging for topic analysis. In the NIPS
analysis, on top of the similar results as in previous dy-
namic models such as DTM, we can extract authors’
interests evolution pattern, which would be the main
result we present for that experiment.

4.1      Twitter Topic Analysis

4.1.1       Data and Model Settings

The Twitter data in the experiment is from the paper
of Yang and Leskovec [18]. We use the English tweets              Figure 2: Topic Popularity on Twitter given by mDTM,
                                                                  over the period of July and August 2009.
from July 1st, 2009 to August 31st, 2009. For each
of the first three days, we randomly sampled 200,000
tweets from the dataset. And around 100,000 tweets                Palin” were mainly popular in July, while the words
were sampled for each of the rest days. We considered             about “Kennedy” and “Glenn Beck” became popular
the hashtags as the metadata in the experiment. After             only at the end of August, all of which roughly match
filtering stop words and ignoring all words appearing             the pattern of search frequencies given by Google
less than 10 times, a vocabulary of 12,000 words is               Trends3 .
selected by TF-IDF ranking. The number of topics
was fixed at 50. In mDTM, time-decay weighted av-                    Table 1: Content evolution of the topic US politics
erage was used for fΩ and fβ . We simply set κ = 0.3.                     Jul 4       Jul 27         Aug 12        Aug 30
Bayesian posterior evolution was used for hashtag and                      palin       obama          health       kennedy
                                                                          obama         palin          care           care
soft-thresholding discussed in Section 3.3.3 was used
                                                                           sarah         tcot         obama           ted
for the evolution of g(hd , t). The parameters λ and                       tcot        sarah           tcot        health
are tuned according to the prediction performance in                     president   president          bill        obama
the first week, which is discussed in Section 4.1.4.                      alaska     healthcare     healthcare        bill
                                                                             al        health         reform         beck
Our main interest is how topic popularity and contents                   honduras     obamas        insurance        glenn
change over time.                                                        governor      speech       president       public
                                                                          palins       alaska          town        president
4.1.2       Topic Popularity Evolution

As can be seen from Equation (2), all the documents               4.1.4     Generality Performance
with different metadata share the common term m,
                                                                  There is no standard method to evaluate dynamic
thus m can be interpreted as community popularity
                                                                  topic models, thus we take a similar approach as in
of topics, separated from the specific preference of
                                                                  [3] to show the prediction performance on the held-
metadata. This shows which topics are more popu-
                                                                  out data. In each day, we treat the next day’s data as
lar on Twitter. Figure 2 gives popularity over the two
                                                                  the held-out data and measure the prediction power of
months of some topics, which we labeled manually af-
                                                                  the model.
ter checking the word distributions of the topics.
                                                                  We compare mDTM with two LDA models without
4.1.3       Topic Contents Evolution                              metadata as in [16] to illustrate the improvement pro-
                                                                  vided by metadata modeling4 . Without metadata, in
Since each topic is represented by a multinomial distri-          the first model, we use LDA on the data of each day
bution, one could find out the important words of the             for inference, and call this model indLDA. The prob-
topics. Table 1 gives the content evolution of the topic
                                                                     3
US politics. It can be seen that obama and tcot2 are                   We don’t provide the results from Google Trends due to
                                                                  the limited space.    The search frequencies can be found at
very important words. However, words about “Sarah                 www.google.com/trends/
                                                                     4
                                                                       We didn’t compare directly with DTM. This is because DTM
   2
       The word tcot represents “top conservatives on twitter”.   cannot be used in an online way, thus it cannot serve our purpose.
                                                            likelihood in the first week. Figure 3 and 4 illustrate
                                                            the results.
                                                            As is shown, mDTM always performs better than the
                                                            other two models. This is not surprising because
                                                            mDTM has more flexible priors. It is interesting that
                                                            LDA-all performs even worse than indLDA. This is
                                                            different from the results of [3]. It might be explained
                                                            by the differences between Twitter data and scientific
                                                            paper data. Twitter’s topic changes so frequently, but
                                                            LDA-all takes all the previous days together, which
                                                            undermines its power.

Figure 3: Negative log-likelihood during the early period   4.1.5   Effects of Metadata
(July 4th - 10th).
                                                            In Twitter analysis, the topic preference of a specific
                                                            hashtag is not of interests. However, incorporating
                                                            hashtags can improve the preformance. On average,
                                                            there are roughly 10 precent of the tweets having hash-
                                                            tags. But such a small proportion of metadata is able
                                                            to provide important improvement of the whole cor-
                                                            pus, even for the tweets without hashtags. We com-
                                                            pute the held-out log-likelihood, for both the model
                                                            inferred without using hashtags as metadata (called
                                                            DTM noTag) and the model mDTM using hashtags.
                                                            mDTM noTag can be seen as TVUM with one user.
                                                            Note that when compute the held-out log-likelihood.
                                                            We take the improvement of hashtags as the improve-
Figure 4: Negative log-likelihood during the end period     ment of negative log-likelihood
(Aug 21st - 30th).
                                                                    (−loglik)DTM noTag − (−loglik)mDTM .

                                                            Figure 5 illustrates the improvement of negative log-
lem here is that there is no clear association for topics
                                                            ikelihood on the held-out data over the period. It
between days. In the second one, we try to overcame
                                                            can be seen that on average, incorporating hashtags
this drawback and take all the data of previous days
                                                            as metadata does improve the performance. And this
for inference, which we call LDA-all. It would take
                                                            improvement tends to grow as time goes. This might
nearly two months’ data at the end of the period. This
                                                            results from the better estimation of most of the meta-
would be too much for computation. Thus we further
                                                            data preference.
subsampled the data from previous days for LDA-all
in the end of the period to make it feasible. LDA-all
                                                            4.1.6   Running Times
will not serve for our purpose and so the main inter-
ests would be comparing indLDA and mDTM. We re-             Here we present a comparison for timing of mDTM and
port the negative log-likelihood on the held-out data       indLDA. Both were implemented in C++, running un-
computed as discussed by Wallach et al[17] over the         der Ubuntu 10.04, with Quad core AMD Opteron Pro-
beginning period (July 4th - 10th) and the end period       cessor and 64 GB RAM. We list average running times
(Aug 21st - 30th). We estimate mDTM as discussed            (rounded) in Table 2. indLDA is the average time on
before, but computed the negative log-likelihood ignor-     10 days (July 4th - July 13th) with 600 sampling itera-
ing the metadata of the held-out data, thus this gives      tions each day. mDTM-1 is the mDTM running on the
us an idea of how metadata can improve the modeling         same data with 600 sampling iterations. Since mDTM
for general documents, even those without metadata.         could inherit information from previous time, we found
There is λ in all of the three models. We tune it and       300 iterations (or less) are enough for valid inference.
the thresholding parameter  by achieving the best log-     Thus we use mDTM-2 to denote mDTM with 300 it-
Figure 6: The human evaluation ACR for the three models. Each box is a value distribution of average correct ratios for
10 topics of the corresponding model on certain day.


                                                             and August 23th (one week before the end) for experi-
                                                             ments. However, news would be difficult for people to
                                                             recognize after more than one year, so we only chose
                                                             10 stable topics from each model5 . For every topic in
                                                             each model, we construct the list by permuting top
                                                             15 words for that topic together with 5 intruder words
                                                             which have low probability in that topic but high prob-
                                                             ability in some other topics. Suppose we have S sub-
                                                             jects, then for each topic k, we compute the average
                                                             correct ratio (ACR)

                                                                                           S
                                                                                           X
                                                                             ACR(k) =             C(s, k)/(5S),
                                                                                            s=1
Figure 5: The improvement of negative log-likelihood via
hashtags over the period. The red lines are the improve-     where C(s, k) is the number of correct intruders cho-
ment of −log(likelihood) computed by importance sam-
                                                             sen by subject s for topic k. We conducted a human
pling. The blue lines are the intervals at each estimation
point given by 2 standard deviations of the sampling.        evaluation experiment on Mechanical Turk with 150
                                                             subjects in total. Figure 6 shows the boxplot of ACR
                                                             distribution within each model on each day.
erations. It can be seen that mDTM is much faster
than LDA.                                                    It can be seen that mDTM does not lose much inter-
                                                             pretability despite its better prediction performance,
    Table 2: Running times of three different models         which is different from the observations in [5]. We hy-
           indLDA      mDTM-1       mDTM-2                   pothesize that this is due to the impacts of metadata.
          58min 41s    67min 13s    39min 24s

                                                             4.2    NIPS Topic Analysis
4.1.7   Interpretability
                                                             In this section, we illustrate a different application
The previous sections show that mDTM is better than
                                                             of mDTM, that is, to extract specific information of
indLDA and LDA-all at generality. However, the in-
                                                             metadata.
terpretability of the topics is also of interests. Chang
et al. [5] revealed that models with better performance
                                                                5
on held-out likelihood might have poor interpretabil-              We count the number of different words in the top 20 words list
                                                             on two consecutive days, and sum such numbers during the whole
ity. Here we use the method in [5] to ask humans             period together. A larger sum number means that the topic word
to evaluate the interpretability. We choose July 4th         list changes frequently. Then we select 10 topics that are the most
                                                             stable. The topics in different time are not associated for indLDA
(the first day after three initial days), July 11th (af-     and LDA-all. We connect a pair of topics between two consecutive
ter one week of July 4th), August 30th (the last day)        days if they have the most overlap on top 20 words.
4.2.1        Data and Model Settings                                                                  topic interests 1997


                                                                                 0.4
The dataset for this experiment contains the text file


                                                                     interests

                                                                                 0.2
of NIPS conference from 1987 to 2003 in Globerson
et al[7]6 . We only use the text of the paper and take


                                                                                 0.0
                                                                                       0       20             40             60            80

the authors as the metadata. The papers in 1987-1990                                                         Topics


were used for the first time unit to initiate the model,                                              topic interests 1998

and each year after that was taken as a new time unit.


                                                                                 0.4
The preprocessing details of the data can be found


                                                                     interests

                                                                                 0.2
on the website. We further deleted all the numbers


                                                                                 0.0
and a few stop words. The resulting vocabulary has                                     0       20             40             60            80

10,005 words. The number of topics K was set as 80.                                                          Topics


Bayesian posterior evolution was used for g and fβ .                                                  topic interests 1999

And fΩ was set as time-decay weighted average with


                                                                                 0.4
κ = 0.3. We don’t use sparse preference in this exam-


                                                                     interests

                                                                                 0.2
ple. The parameter λ is again tuned by log-likelihood


                                                                                 0.0
as before.                                                                             0       20             40             60            80

                                                                                                             Topics


4.2.2        Author-topic interests
                                                                     Figure 7: Topic preference from mDTM of 80 topics, for
As before, we could see the topic contents and pop-
                                                                     author “Jordan M” in 1997, 1998 and 1999.
ularity trends over time. Here, we only focus on the
special information given by metadata in this exper-
iment. When taking authors as metadata, an inter-                                 Topic 60     Topic 63         Topic 75           Topic 78
                                                                                  clustering   function        variational           model
esting information result provided by mDTM is the
                                                                                    clusters    number            nodes               data
interests of authors, similar to the results of [14]. Fig-                       information      figure        networks             models
ure 7 shows the results given by mDTM for author                                      data       results        inference         parameters
“Jordan M”. The height of the red bars represents the                             algorithm         set         gaussian           likelihood
µ̂k,h from Equation (4) for h=“Jordan M”, which can                                 cluster        data         graphical           mixture
                                                                                    feature        case            field          distribution
be interpreted as the topic interests according to the
                                                                                   selection      based        conditional             log
past information.                                                                     risk       model           jordan                em
                                                                                   partition    problem           node              gaussian
It can be seen that authors’ favorite topics remained
nearly the same during the three years, though the in-               Table 3: Four significant topics for “Jordan M” selected
terest level for individual topics varied. When we know              from Figure 7 in 1999.
the topic interests of the author, we can further inves-
tigate the contents of the user’s favorite topics, which
is a way to detect the user’s interests that would be                evolution patterns are proposed, which can be chosen
useful in many applications. Table 3 shows the top 10                according to properties of data and the applications.
words for four topics of significant interests to “Jor-              We also demonstrate the use of the model on Twitter
dan M” in 1999, according to the result in Figure 7.                 data and NIPS data, revealing its advantage with re-
We can roughly see they are mainly about “cluster-                   spect to generality, computation and interpretability.
ing methods”, “common descriptive terms”, “graphi-
cal models” and “mixture models & density estima-                    The work can be extended in many new ways. For the
tion”, which is a reasonable approximation.                          moment, it cannot model the birth and death of topics.
                                                                     One way to solve this problem is to use general prior
5       Conclusion                                                   allocation mechanism such as HDP. There has been
                                                                     work using this idea for static models. In addition, the
In this paper, we have developed a topic evolution                   generality and flexibility of mDTM make it possible
model that incorporats metadata impacts. Flexible                    to build other evolution patterns for hyperparameters,
                                                                     which might be more suitable for specific purposes of
    6
        Data can be found at http://ai.stanford.edu/ gal/data.html   modeling.
References                                                           the 2009 Conference on Empirical Methods in Natural
                                                                     Language Processing, pages 248–256, Singapore, Au-
 [1] A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J.              gust 2009. Association for Computational Linguistics.
     Smola. Scalable distributed inference of dynamic
     user interests for behavioral targeting. In Proceedings    [14] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and
     of the 17th ACM SIGKDD international conference                 P. Smyth. The author-topic model for authors and
     on Knowledge discovery and data mining, KDD ’11,                documents. In Proceedings of the 20th conference on
     pages 114–122, New York, NY, USA, 2011. ACM.                    Uncertainty in artificial intelligence, UAI ’04, pages
                                                                     487–494, Arlington, Virginia, United States, 2004.
 [2] C. E. Antoniak. Mixtures of Dirichlet Processes with            AUAI Press.
     Applications to Bayesian Nonparametric Problems.
     The Annals of Statistics, 2(6):1152–1174, 1974.            [15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
                                                                     Hierarchical Dirichlet processes. Journal of the Ameri-
 [3] D. M. Blei and J. D. Lafferty. Dynamic topic mod-               can Statistical Association, 101(476):1566–1581, 2006.
     els. In Proceedings of the 23rd international conference
     on Machine learning, ICML ’06, pages 113–120, New          [16] H. Wallach, D. Mimno, and A. McCallum. Rethinking
     York, NY, USA, 2006. ACM.                                       lda: Why priors matter. In Y. Bengio, D. Schuurmans,
                                                                     J. Lafferty, C. K. I. Williams, and A. Culotta, editors,
 [4] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty.            Advances in Neural Information Processing Systems
     Latent dirichlet allocation. Journal of Machine Learn-          22, pages 1973–1981. 2009.
     ing Research, 3, 2003.
                                                                [17] H. M. Wallach, I. Murray, R. Salakhutdinov, and
 [5] J. Chang, J. Boyd-Graber, C. Wang, S. Gerrish, and              D. Mimno. Evaluation methods for topic models. In
     D. M. Blei. Reading tea leaves: How humans inter-               Proceedings of the 26th Annual International Confer-
     pret topic models. In Neural Information Processing             ence on Machine Learning, ICML ’09, pages 1105–
     Systems, 2009.                                                  1112, New York, NY, USA, 2009. ACM.

 [6] L. Fei-Fei and P. Perona. A bayesian hierarchical          [18] J. Yang and J. Leskovec. Patterns of temporal varia-
     model for learning natural scene categories. CVPR,              tion in online media. In Proceedings of the fourth ACM
     pages 524–531, 2005.                                            international conference on Web search and data min-
                                                                     ing, WSDM ’11, pages 177–186, New York, NY, USA,
 [7] A. Globerson, G. Chechik, F. Pereira, and N. Tishby.            2011. ACM.
     Euclidean Embedding of Co-occurrence Data. The
     Journal of Machine Learning Research, 8:2265–2295,
     2007.

 [8] T. L. Griffiths and M. Steyvers. Finding scientific top-
     ics. Proceedings of the National Academy of Sciences
     of the United States of America, 101(Suppl 1):5228–
     5235, Apr. 2004.

 [9] Q. He, B. Chen, J. Pei, B. Qiu, P. Mitra, and L. Giles.
     Detecting topic evolution in scientific literature: how
     can citations help? In Proceeding of the 18th ACM
     conference on Information and knowledge manage-
     ment, CIKM ’09, pages 957–966, New York, NY, USA,
     2009. ACM.

[10] T. Iwata, T. Yamada, Y. Sakurai, and N. Ueda. On-
     line multiscale dynamic topic models. In Proceedings
     of the 16th ACM SIGKDD international conference
     on Knowledge discovery and data mining, KDD ’10,
     pages 663–672, New York, NY, USA, 2010. ACM.

[11] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov
     chains and mixing times. American Mathematical So-
     ciety, 2009.

[12] D. Ramage, S. Dumais, and D. Liebling. Characteriz-
     ing microblogs with topic models. In ICWSM, 2010.

[13] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning.
     Labeled LDA: A supervised topic model for credit at-
     tribution in multi-labeled corpora. In Proceedings of

</pre>