=Paper=
{{Paper
|id=None
|storemode=property
|title=Incorporating Metadata into Dynamic Topic Analysis
|pdfUrl=https://ceur-ws.org/Vol-962/paper05.pdf
|volume=Vol-962
|dblpUrl=https://dblp.org/rec/conf/uai/LiKWK12
}}
==Incorporating Metadata into Dynamic Topic Analysis==
Incorporating Metadata into Dynamic Topic Analysis Tianxi Li Branislav Kveton Yu Wu Ashwin Kashyap Stanford University Technicolor Stanford University Technicolor Stanford, CA 94305 Palo Alto, CA 94301 Stanford, CA 94305 Palo Alto, CA 94301 tianxili@stanford.edu Branislav.Kveton@technicolor.com yuw2@stanford.edu Ashwin.Kashyap@technicolor.com Abstract as in Ahmed et al. [1] In such circumstances, topic evolution gains other practical values. For example, Everyday millions of blogs and micro-blogs knowing the evolution of people’s behaviors could im- are posted on the Internet These posts usu- prove the performance of item recommendations and ally come with useful metadata, such as tags, advertising strategy. In addition, dynamic feature ex- authors, locations, etc. Much of these data traction might also provide richer user profile. are highly specific or personalized. Track- In various applications, one might want to harness ing the evolution of these data helps us to metadata for different purposes. When metadata con- discover trending topics and users’ interests, tains useful information for the topic analysis, it can which are key factors in recommendation and help enhance the precision of the model. For instance, advertisement placement systems. In this pa- authorship can be used as an indicator of the topics in per, we use topic models to analyze topic evo- scientific paper analyzing [14]. Citations can also help lution in social media corpora with the help of reveal the paper’s topics [9]. In behavior modeling, metadata. Specifically, we propose a flexible metadata such as user id could be used for personal- dynamic topic model which can easily incor- ized analysis. porate various type of metadata. Since our model adds negligible computation cost on In this paper, we propose topic evolution model the top of Latent Dirichlet Allocation, it can incorporating metadata effects, named metadata- be implemented very efficiently. We test our incorporated dynamic topic model (mDTM). This is model on both Twitter data and NIPS pa- a flexible model effective for various metadata types per collection. The results show that our ap- and evolution patterns. We demonstrate its applica- proach provides better performance in terms bility by modeling the topic evolution of Twitter data, of held-out likelihood, yet still retains good where we use hashtags as the metadata. This prob- interpretability. lem is particularly challenging because of the limited length of tweets and their non-standard webish style. Later we use authors as the metadata to run a dynamic 1 Introduction author-interest analysis on the NIPS corpus. Topic evolution analysis has become increasingly im- The paper is organized as following. Section 2 gives a portant in recent years. Such analysis on social me- brief description of backgrounds and prior work. Our dia and webpages could help people understand in- model is introduced in Section 3. Finally, the illustra- formation spreading better. In addition, it also pro- tive examples of topic evolution analysis are presented vides ways to understand latent patterns of corpus, in Section 4. reduce effective dimensionality and classify documents and data. Meanwhile, reseachers manage to fit vari- 2 Notations and Related Work ous types of data into the topic model. For example, image segmentations was modeled as topics in Feifei In this paper, the corpus is denoted by D, and each et al. [6]. User behaviors were also modeled as topics document d in corpus consists of Nd words. Each word w is an element in the vocabulary of size V . There are K different topics associated with the corpus. As- sume the words in the same document are exchange- able. The case of interests is when the documents have other special metadata. We use h to represent the metadata. Assume h ∈ H, where H is the domain of h. For instance, when h is hashtag of a tweet, H can be all the strings of hashtags. Let hd be the in- (a) Original LDA graphical structure stantiation of h ∈ H at document d. Now with above notations, we can define the topics to be probability distributions over the vocabulary. Let p(w|z) be the probability of word w appears when the topic is z, then topic z is represented by a V -vector corresponding to a multinomial distribution: (p(1|z), p(2|z) · · · , p(V |z)). (b) Asymmetric LDA with priors. Latent Dirichlet Allocation proposed by Blei et al.[4], is one of the most popular models for topic analysis. Figure 1: Graphical structures of LDA models. LDA assumes the documents are generated by the fol- lowing process: is a nonparametric prior allocation process proposed (i) for each topic k = 1, · · · , K : in Teh et al.[15]. Adding the extra prior Ω, the graph- ical structure of LDA can be represent by Figure 1(b). Draw word distribution by φk ∼ Dir(β). (ii) for each document d in the corpus : As mentioned in Section 1, we would like to take meta- (a) Draw a vector of mixture proportion by data into consideration as in [14]. Labeled-LDA (Ra- θd ∼ Dir(α). mage et al.[13]) provides another method to use meta- (b) for each word position j in d : data, requiring topics to be chosen from a subset of (b1): Draw a topic for the position by the label set, where the labels can incorporate certain zd,j ∼ mult(θd ). kinds of metadata. Statistically speaking, this works like adding sparse mixture priors. In Ramage et al.[12], (b2): Draw a word for the position by labeled-LDA is used for Twitter data. However, there wd,j ∼ mult(φzd,j ). is no natural way to create labels for different meta- data. Such models assume specific generative process In the process, α is a K-vector and β is a V -vector. for metadata influences, which often limits the model θd ’s are K-vectors characterizing a multinomial distri- to certain metadata. However, in our model, the im- bution of the topic mixture for each document d. α pacts of metadata are modeled by empirical estima- and β are called hyperparameters. Throughout this tion rather than a specific probabilistic process, which paper, we will use wd,j and zd,j to denote the word makes it valid generally. and topic in position j of document d respectively. Dir(α) denotes the Dirichlet distribution with param- On the other hand, we need dynamic models to analyze eter α, and mult(θ) denotes the 1-trial multinomial topic evolution. The dynamic topic model (DTM) pro- distribution.The model structure of LDA is shown in posed by Blei works well on the example of science pa- Figure 1(a), where we use Φ to represent the vec- pers [3]. However, its logistic Gaussian assumption is tors {φ1 · · · φK }. In most cases, α and β are chosen no longer conjugate to multinomial distribution, which to be symmetric vectors. There is work (Wallach et makes the computation inefficient. Moreover, it is an al.[16]) showing that LDA with asymmetric hyperpa- offline model that needs the entire corpus at one time, rameters can outperform symmetric settings. For a thus not suitable for stream data. Iwata et al.[10] uses K-vector Ω = (Ω1 , · · · , ΩK ), they added the prior of multi-scale terms to incorporate time relation. This α so as α ∼ Dir(Ω) can connect LDA to mixture model method can be very complicated in some cases and given by Hierarchical Dirichlet Process (HDP), which therefore infeasible for large scale datasets. But the idea of modeling the relation by hyperparameter is re- ment) as the stationary distribution of a Markov chain ally effective in many problems. In [1], a time-varying with the transition probability given in Formula (1), user model (TVUM) is proposed. It considers users’ in which the superscript −(d, j) refers to the originally behaviors over time, and connects different users by defined variables without considering the position j of the general sampling process. Here we can take a dif- document d, and w−(d,j) , z−(d,j) are the variables of ferent viewpoint of TVUM. Note that when we take the words and topics of the corpus except the ones at each user’s identity as the metadata, TVUM is actu- position j in document d, that is, wd,j and zd,j respec- ally using metadata for interests evolution. In this as- tively. pect, it can be seen as a special case and also a starting The interpretation of this transition probability is that point of our model. the Markov chain evolves with the following two pat- terns to arrive new topic states in the document. (i) In the next section, we begin from another view of Choose a topic proportional to the existing topics dis- LDA model, and generalize it to incorporate metadata. tribution within the document. This means it tends to keep the topic of each position consistent with the 3 Metadata-incorporated Dynamic document contents. (ii) With certain probability, it Topic Model might choose a topic ignoring the existing contents of the document. However, this choice is based on the 3.1 Motivation: Define LDA via Markov popularity of topics over the entire corpus. This is a Chains reasonable assumption in many circumstances, and we believe this could explain the power of LDA. The inference of LDA can be done through MCMC sampling. The sampling inference algorithm was pro- P (zd,j = k|w−(d,j) , z−(d,j) ) ∝ posed in Griffiths et al.[8]. But to understand how −(d,j) mk + Ωk (nd,k,· + λ P )P (wd,j |φk ). (1) LDA works, we need to use the smoother version mk + Ωk shown in Figure 1. It is shown in [15] that LDA in this case limits to a HDP mixture model as K → ∞. Thus we will introduce a few more notations and start 3.2 Generalization: mDTM from HDP aspect of LDA. In the rest of the paper, we Assume the corpus has metadata h. Our basic as- will use subscript d to denote relevant variables associ- sumption is that metadata is a good indicator of topics ated with document d, subscript k to denote the vari- for each document. For example, a tweet with hash- able associated with topic k, and w to denote variables tag “#Microsoft” is much more likely to talk about associated with word w. Following this style, mk is technology rather than sports. Nearly all the previ- defined as the number of documents containing words ous works involving a certain type of metadata rely generated from topic k and m = (m1 , m2 , · · · , mK ), on this assumption. We first define the preferences of while nd,k,w is the number of words w in document metadata over time as a vector function of t and h, d that is from topic k. We further use · to de- g(h, t) = (g1 (h, t), g2 (h, t), · · · gK (h, t)). The kth ele- note summation over a specific variable, so n·,k,w is ment gk (h, t) is the preference of h to topic k at time the number of occurrence of words w being drawn t. Since we want to build a dynamic model for topic from topic k and, nk = (n·,k,1 , n·,k,2 , · · · , n·,k,V ). In evolution, we can learn g(h, t), and turn it into another addition, nd,k,· is the number of words in d which impact on top of the evolutionary effects of β and Ω. are associated with topic k. When we want to Motivated by the definition of LDA given by (1), we discuss the variables at time t, we use the super- define the mDTM inference at a fixed time slice to script xt to represent the variable x in the model of be the stationary distribution of a Markov chain with time t. So we have mt = (mt1 , mt2 , · · · , mtK ),ntk = transition probability (nt·,k,1 , nt·,k,2 , · · · , nt·,k,V ). When we focus on discus- sions at one time slice, which is clear in context, we P (zd,j = k|w−(d,j) , z−(d,j) ∝ will ignore the superscript t. −(d,j) mk + Ωtk (nd,k,· + gk (hd , t) + λ P )P (wd,j |φtk ). (2) According to the discussion in [15] and the mechanism mk + Ωtk of Gibbs sampler, we can equivalently define the LDA The modification we make has exact effects that we inference of topic z (for each position of each docu- want to incorporate into (1). In addition, this process provided by mDTM is simple and does not incure too 3.3 Evolution Patterns of mDTM much computation, as shown in the Section 3.4. We only focus on the case where there is only one meta- Now we describe how to model g. Assume metadata is data variable in our discussion. There might be the categorical which is the case we normally encounter in case that more than one metadata variables are asso- applications. Similar methods can be used to choose ciated with the corpus. For instance, we might have fΩ and fβ , so we will only discuss the evolution pat- timezone and browser for web log. In this case, we tern for g(h, t) in detail. We use ñtk,h to denote the can simply model the effects as additive and estimate number of the topic k that occurs in all documents the function g(h, t) separately for each metadata vari- having metadata h at time t. able. Then everything we discuss here could be used for multiple metadata variable case. As we will pro- 3.3.1 Time-decay Weighted Evolution pose different evolution patterns for the parameters in We can just take gk as the weighted average number later sections, here we introduce notation fΩ and fβ of topics k appearing in documents with metadata h, as the evolution functions of Ω and β. Now taking the using the weights decays over time. This represents time effects of evolution into consideration, the entire our belief that the recent information is more useful evolution process of mDTM is as follows: to predict the preference. Thus, X gk (h, t) = σ κt−s ñsk,h , (3) s0: metadata. This is a straightforward way to encode the (a) Draw Ωt according to the model of t − 1 evolution pattern, and the computation is very easy. by Ωt = fΩ (t − 1). (b) For each topic k = 1, · · · , K : Draw βk by 3.3.2 Bayesian Posterior Evolution βkt = fβ (t − 1). For each h ∈ H, we assume there is a preference vector (c) With the current Ωt and {βkt }K k=1 , for h to be µth = (µt1,h , µt2,h , · · · µtK,h ) which is a vector implement the inference for the process in the K − 1 dimensional simplex, with µtk,h ≥ 0 for k = 1 · · · K. Then the realization of choosing topic described by equation (2). for any h ∈ H can be seen as (ñt1,h , ñt2,h , · · · ñtK,h ) ∼ Multinomial(ñth , µth ), the ñth -trial multinomial distri- bution which is sum of ñth independent trials from mult(µth ), where ñth is the total number of observa- We model the evolution of all the effects by separable tions of h over the corpus. So we can take the Bayesian steps, so the model can be updated when data in new estimation by adding a Dirichlet prior by the process: time slice arrives, which makes it possible for stream data processing and online inference. It is very flexible to adjust mDTM for different types of metadata, with µth ∼ Dir(ζ t−1 · µ̂t−1 h ) different properties as we do not have to assume spe- cific properties of the metadata. Notice that though (ñt1,h , ñt2,h , · · · ñtK,h ) ∼ Multinomial(ñth , µth ) we generalize the Markov chain definition of LDA to mDTM, we haven’t shown the existence of the station- In such settings, we can choose the posterior expec- ary distribution or limiting behavior of the chain. To tation as the estimator, which is address this issue, we can check mixing of the chain, ñtk,h + ζ t−1 · µ̂t−1 k,h so as to know if the inference is valid. In all of our t µ̂k,h = P t . (4) experiments, such validity is observed. For details and ñk,h + ζ t−1 · µ̂t−1 k,h methods about mixing behavior of Markov chains, we refer to Levin et al. (2009) [11]. ζ is a scalar representing influence of the prior, which The evolution patterns fΩ (t), fβ (t) and g(h, t) are ad- is the Bayesian estimator from previous time. Then dressed in Section 3.3. Then we give the inference let steps of mDTM in Section 3.4. gk (h, t) = σ µ̂tk,h in the process, where σ is a scalar representing the over time. In particular, if we use the time-decay av- influence of the metadata. Such evolution pattern is erage discussed in Section 3.3.1, the resulting model very simple and smooth and it adds almost no addi- is equivalent to TVUM after some simple derivation 1 tional computation cost. . This connection gives an vivid example about how to transform a specific problem into the settings of This pattern actually also assumes there is a hyper- mDTM. parameter in each time t, which is µth . Rather than setting it beforehand, we impute the estimate for such The time-varying relationship of mDTM can be rep- hyperparameters by inference from the model. This is resented by a separable term, thus we can incorporate the idea of empirical Bayes method. In particular, one the time-related term and the topic modeling for a could notice that if there is no new data for h after fixed time separately. For a fixed time unit, the in- time t, Bayesian posterior evolution would remain the ference process by Gibbs sampling is easy to derive. same, while the time-decay evolution gradually shrinks Since the special case mentioned before is equivalent g to zero. to TVUM, we derive the inference process by analogy to that shown in [1]. Suppose now we have the model 3.3.3 Sparse Preference in previous time t − 1, the whole process for inference of t is as follows: In certain cases, we might constrain each document to only choose a small proportion of K topics. Our (i) Update the new hyperparameters Ωt and β t for method to achieve this goal is to force sparsity on the time t according to the chosen evolution pattern. topic choosing process. We can take the occasional (ii) Initially set the starting values. We could set the appearance of most of the topics as noise, then imple- initial value of α as Ωt . The initial values for the counts ment a thresholding to denoise and get the true sparse at time t, that is mtk , nt·,k,w , nd,k,· , can be computed preference. Define the function S(a, ) as hard or soft after randomly choosing topics for each documents and thresholding operator where is the threshold. Then words. we can process each variate of the vector resulting from the previous evolution pattern by S, resulting a sparse (iii) For each document d, compute the g(hd , t) ac- vector. The soft and hard thresholding functions are cording to the chosen evolution pattern in Section 3.3. defined respectively as Then sample the topic for each word position j by the formula Ssoft (a, ) = sign(a) · max{|a| − , 0} P (zd,j = k|wd,j , others) Shard (a, ) = sign(a) · I{|a| > } t,−(d,j) t −(d,j) n·,k,wd,j + βk,w d,j ∝ (nd,k,· +gk (hd , t)+λαd,k )· PV t,−(d,j) . 3.3.4 Choice of fΩ and fβ t w=1 n·,k,w + βk,w Similar evolution patterns for fΩ and fβ can be cho- sen. With certain variable changed according to the (iv) Sample mtk from the Antoniak distribution [2] for settings. For fΩ , one could use mtk to replace ñtk,h in Dirichlet process. (3) and (4). The evolution pattern of β can be derived via replacing ñtk,h in (3), (4) by ntk . (v) Sample α from Dir(mt + Ωt ). And repeat (iii)-(v). 3.4 Inference 4 Experiments As mentioned previously, mDTM can be seen as a gen- To illustrate the model, we conducted two experiments eralization of TVUM. Suppose now we take user-ID as in which metadata is used for different purposes. We the only metadata which is categorical, and assume first use mDTM on Twitter data for topic analysis, in that each document belongs to a certain user-ID, then which we take hashtags as the metadata. In the sec- the parameters associated with each category of the ond experiment, we fit our model on the NIPS paper metadata in mDTM become the parameters associated corpus and try to find information for specific authors, with a particular user. Furthermore, suppose that the 1 Actually, TVUM has a slightly different way to define the evo- documents are the browsing history of a user, then lution of Ω, which defines the average in different scales of time, mDTM will be modeling the user’s browsing behavior such as daily, weekly and monthly average. which we use as metadata. For conciseness, we mainly discuss the former in detail, because Twitter data is special and challenging for topic analysis. In the NIPS analysis, on top of the similar results as in previous dy- namic models such as DTM, we can extract authors’ interests evolution pattern, which would be the main result we present for that experiment. 4.1 Twitter Topic Analysis 4.1.1 Data and Model Settings The Twitter data in the experiment is from the paper of Yang and Leskovec [18]. We use the English tweets Figure 2: Topic Popularity on Twitter given by mDTM, over the period of July and August 2009. from July 1st, 2009 to August 31st, 2009. For each of the first three days, we randomly sampled 200,000 tweets from the dataset. And around 100,000 tweets Palin” were mainly popular in July, while the words were sampled for each of the rest days. We considered about “Kennedy” and “Glenn Beck” became popular the hashtags as the metadata in the experiment. After only at the end of August, all of which roughly match filtering stop words and ignoring all words appearing the pattern of search frequencies given by Google less than 10 times, a vocabulary of 12,000 words is Trends3 . selected by TF-IDF ranking. The number of topics was fixed at 50. In mDTM, time-decay weighted av- Table 1: Content evolution of the topic US politics erage was used for fΩ and fβ . We simply set κ = 0.3. Jul 4 Jul 27 Aug 12 Aug 30 Bayesian posterior evolution was used for hashtag and palin obama health kennedy obama palin care care soft-thresholding discussed in Section 3.3.3 was used sarah tcot obama ted for the evolution of g(hd , t). The parameters λ and tcot sarah tcot health are tuned according to the prediction performance in president president bill obama the first week, which is discussed in Section 4.1.4. alaska healthcare healthcare bill al health reform beck Our main interest is how topic popularity and contents honduras obamas insurance glenn change over time. governor speech president public palins alaska town president 4.1.2 Topic Popularity Evolution As can be seen from Equation (2), all the documents 4.1.4 Generality Performance with different metadata share the common term m, There is no standard method to evaluate dynamic thus m can be interpreted as community popularity topic models, thus we take a similar approach as in of topics, separated from the specific preference of [3] to show the prediction performance on the held- metadata. This shows which topics are more popu- out data. In each day, we treat the next day’s data as lar on Twitter. Figure 2 gives popularity over the two the held-out data and measure the prediction power of months of some topics, which we labeled manually af- the model. ter checking the word distributions of the topics. We compare mDTM with two LDA models without 4.1.3 Topic Contents Evolution metadata as in [16] to illustrate the improvement pro- vided by metadata modeling4 . Without metadata, in Since each topic is represented by a multinomial distri- the first model, we use LDA on the data of each day bution, one could find out the important words of the for inference, and call this model indLDA. The prob- topics. Table 1 gives the content evolution of the topic 3 US politics. It can be seen that obama and tcot2 are We don’t provide the results from Google Trends due to the limited space. The search frequencies can be found at very important words. However, words about “Sarah www.google.com/trends/ 4 We didn’t compare directly with DTM. This is because DTM 2 The word tcot represents “top conservatives on twitter”. cannot be used in an online way, thus it cannot serve our purpose. likelihood in the first week. Figure 3 and 4 illustrate the results. As is shown, mDTM always performs better than the other two models. This is not surprising because mDTM has more flexible priors. It is interesting that LDA-all performs even worse than indLDA. This is different from the results of [3]. It might be explained by the differences between Twitter data and scientific paper data. Twitter’s topic changes so frequently, but LDA-all takes all the previous days together, which undermines its power. Figure 3: Negative log-likelihood during the early period 4.1.5 Effects of Metadata (July 4th - 10th). In Twitter analysis, the topic preference of a specific hashtag is not of interests. However, incorporating hashtags can improve the preformance. On average, there are roughly 10 precent of the tweets having hash- tags. But such a small proportion of metadata is able to provide important improvement of the whole cor- pus, even for the tweets without hashtags. We com- pute the held-out log-likelihood, for both the model inferred without using hashtags as metadata (called DTM noTag) and the model mDTM using hashtags. mDTM noTag can be seen as TVUM with one user. Note that when compute the held-out log-likelihood. We take the improvement of hashtags as the improve- Figure 4: Negative log-likelihood during the end period ment of negative log-likelihood (Aug 21st - 30th). (−loglik)DTM noTag − (−loglik)mDTM . Figure 5 illustrates the improvement of negative log- lem here is that there is no clear association for topics ikelihood on the held-out data over the period. It between days. In the second one, we try to overcame can be seen that on average, incorporating hashtags this drawback and take all the data of previous days as metadata does improve the performance. And this for inference, which we call LDA-all. It would take improvement tends to grow as time goes. This might nearly two months’ data at the end of the period. This results from the better estimation of most of the meta- would be too much for computation. Thus we further data preference. subsampled the data from previous days for LDA-all in the end of the period to make it feasible. LDA-all 4.1.6 Running Times will not serve for our purpose and so the main inter- ests would be comparing indLDA and mDTM. We re- Here we present a comparison for timing of mDTM and port the negative log-likelihood on the held-out data indLDA. Both were implemented in C++, running un- computed as discussed by Wallach et al[17] over the der Ubuntu 10.04, with Quad core AMD Opteron Pro- beginning period (July 4th - 10th) and the end period cessor and 64 GB RAM. We list average running times (Aug 21st - 30th). We estimate mDTM as discussed (rounded) in Table 2. indLDA is the average time on before, but computed the negative log-likelihood ignor- 10 days (July 4th - July 13th) with 600 sampling itera- ing the metadata of the held-out data, thus this gives tions each day. mDTM-1 is the mDTM running on the us an idea of how metadata can improve the modeling same data with 600 sampling iterations. Since mDTM for general documents, even those without metadata. could inherit information from previous time, we found There is λ in all of the three models. We tune it and 300 iterations (or less) are enough for valid inference. the thresholding parameter by achieving the best log- Thus we use mDTM-2 to denote mDTM with 300 it- Figure 6: The human evaluation ACR for the three models. Each box is a value distribution of average correct ratios for 10 topics of the corresponding model on certain day. and August 23th (one week before the end) for experi- ments. However, news would be difficult for people to recognize after more than one year, so we only chose 10 stable topics from each model5 . For every topic in each model, we construct the list by permuting top 15 words for that topic together with 5 intruder words which have low probability in that topic but high prob- ability in some other topics. Suppose we have S sub- jects, then for each topic k, we compute the average correct ratio (ACR) S X ACR(k) = C(s, k)/(5S), s=1 Figure 5: The improvement of negative log-likelihood via hashtags over the period. The red lines are the improve- where C(s, k) is the number of correct intruders cho- ment of −log(likelihood) computed by importance sam- sen by subject s for topic k. We conducted a human pling. The blue lines are the intervals at each estimation point given by 2 standard deviations of the sampling. evaluation experiment on Mechanical Turk with 150 subjects in total. Figure 6 shows the boxplot of ACR distribution within each model on each day. erations. It can be seen that mDTM is much faster than LDA. It can be seen that mDTM does not lose much inter- pretability despite its better prediction performance, Table 2: Running times of three different models which is different from the observations in [5]. We hy- indLDA mDTM-1 mDTM-2 pothesize that this is due to the impacts of metadata. 58min 41s 67min 13s 39min 24s 4.2 NIPS Topic Analysis 4.1.7 Interpretability In this section, we illustrate a different application The previous sections show that mDTM is better than of mDTM, that is, to extract specific information of indLDA and LDA-all at generality. However, the in- metadata. terpretability of the topics is also of interests. Chang et al. [5] revealed that models with better performance 5 on held-out likelihood might have poor interpretabil- We count the number of different words in the top 20 words list on two consecutive days, and sum such numbers during the whole ity. Here we use the method in [5] to ask humans period together. A larger sum number means that the topic word to evaluate the interpretability. We choose July 4th list changes frequently. Then we select 10 topics that are the most stable. The topics in different time are not associated for indLDA (the first day after three initial days), July 11th (af- and LDA-all. We connect a pair of topics between two consecutive ter one week of July 4th), August 30th (the last day) days if they have the most overlap on top 20 words. 4.2.1 Data and Model Settings topic interests 1997 0.4 The dataset for this experiment contains the text file interests 0.2 of NIPS conference from 1987 to 2003 in Globerson et al[7]6 . We only use the text of the paper and take 0.0 0 20 40 60 80 the authors as the metadata. The papers in 1987-1990 Topics were used for the first time unit to initiate the model, topic interests 1998 and each year after that was taken as a new time unit. 0.4 The preprocessing details of the data can be found interests 0.2 on the website. We further deleted all the numbers 0.0 and a few stop words. The resulting vocabulary has 0 20 40 60 80 10,005 words. The number of topics K was set as 80. Topics Bayesian posterior evolution was used for g and fβ . topic interests 1999 And fΩ was set as time-decay weighted average with 0.4 κ = 0.3. We don’t use sparse preference in this exam- interests 0.2 ple. The parameter λ is again tuned by log-likelihood 0.0 as before. 0 20 40 60 80 Topics 4.2.2 Author-topic interests Figure 7: Topic preference from mDTM of 80 topics, for As before, we could see the topic contents and pop- author “Jordan M” in 1997, 1998 and 1999. ularity trends over time. Here, we only focus on the special information given by metadata in this exper- iment. When taking authors as metadata, an inter- Topic 60 Topic 63 Topic 75 Topic 78 clustering function variational model esting information result provided by mDTM is the clusters number nodes data interests of authors, similar to the results of [14]. Fig- information figure networks models ure 7 shows the results given by mDTM for author data results inference parameters “Jordan M”. The height of the red bars represents the algorithm set gaussian likelihood µ̂k,h from Equation (4) for h=“Jordan M”, which can cluster data graphical mixture feature case field distribution be interpreted as the topic interests according to the selection based conditional log past information. risk model jordan em partition problem node gaussian It can be seen that authors’ favorite topics remained nearly the same during the three years, though the in- Table 3: Four significant topics for “Jordan M” selected terest level for individual topics varied. When we know from Figure 7 in 1999. the topic interests of the author, we can further inves- tigate the contents of the user’s favorite topics, which is a way to detect the user’s interests that would be evolution patterns are proposed, which can be chosen useful in many applications. Table 3 shows the top 10 according to properties of data and the applications. words for four topics of significant interests to “Jor- We also demonstrate the use of the model on Twitter dan M” in 1999, according to the result in Figure 7. data and NIPS data, revealing its advantage with re- We can roughly see they are mainly about “cluster- spect to generality, computation and interpretability. ing methods”, “common descriptive terms”, “graphi- cal models” and “mixture models & density estima- The work can be extended in many new ways. For the tion”, which is a reasonable approximation. moment, it cannot model the birth and death of topics. One way to solve this problem is to use general prior 5 Conclusion allocation mechanism such as HDP. There has been work using this idea for static models. In addition, the In this paper, we have developed a topic evolution generality and flexibility of mDTM make it possible model that incorporats metadata impacts. Flexible to build other evolution patterns for hyperparameters, which might be more suitable for specific purposes of 6 Data can be found at http://ai.stanford.edu/ gal/data.html modeling. References the 2009 Conference on Empirical Methods in Natural Language Processing, pages 248–256, Singapore, Au- [1] A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. gust 2009. Association for Computational Linguistics. Smola. Scalable distributed inference of dynamic user interests for behavioral targeting. In Proceedings [14] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and of the 17th ACM SIGKDD international conference P. Smyth. The author-topic model for authors and on Knowledge discovery and data mining, KDD ’11, documents. In Proceedings of the 20th conference on pages 114–122, New York, NY, USA, 2011. ACM. Uncertainty in artificial intelligence, UAI ’04, pages 487–494, Arlington, Virginia, United States, 2004. [2] C. E. Antoniak. Mixtures of Dirichlet Processes with AUAI Press. Applications to Bayesian Nonparametric Problems. The Annals of Statistics, 2(6):1152–1174, 1974. [15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the Ameri- [3] D. M. Blei and J. D. Lafferty. Dynamic topic mod- can Statistical Association, 101(476):1566–1581, 2006. els. In Proceedings of the 23rd international conference on Machine learning, ICML ’06, pages 113–120, New [16] H. Wallach, D. Mimno, and A. McCallum. Rethinking York, NY, USA, 2006. ACM. lda: Why priors matter. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, [4] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty. Advances in Neural Information Processing Systems Latent dirichlet allocation. Journal of Machine Learn- 22, pages 1973–1981. 2009. ing Research, 3, 2003. [17] H. M. Wallach, I. Murray, R. Salakhutdinov, and [5] J. Chang, J. Boyd-Graber, C. Wang, S. Gerrish, and D. Mimno. Evaluation methods for topic models. In D. M. Blei. Reading tea leaves: How humans inter- Proceedings of the 26th Annual International Confer- pret topic models. In Neural Information Processing ence on Machine Learning, ICML ’09, pages 1105– Systems, 2009. 1112, New York, NY, USA, 2009. ACM. [6] L. Fei-Fei and P. Perona. A bayesian hierarchical [18] J. Yang and J. Leskovec. Patterns of temporal varia- model for learning natural scene categories. CVPR, tion in online media. In Proceedings of the fourth ACM pages 524–531, 2005. international conference on Web search and data min- ing, WSDM ’11, pages 177–186, New York, NY, USA, [7] A. Globerson, G. Chechik, F. Pereira, and N. Tishby. 2011. ACM. Euclidean Embedding of Co-occurrence Data. The Journal of Machine Learning Research, 8:2265–2295, 2007. [8] T. L. Griffiths and M. Steyvers. Finding scientific top- ics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228– 5235, Apr. 2004. [9] Q. He, B. Chen, J. Pei, B. Qiu, P. Mitra, and L. Giles. Detecting topic evolution in scientific literature: how can citations help? In Proceeding of the 18th ACM conference on Information and knowledge manage- ment, CIKM ’09, pages 957–966, New York, NY, USA, 2009. ACM. [10] T. Iwata, T. Yamada, Y. Sakurai, and N. Ueda. On- line multiscale dynamic topic models. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’10, pages 663–672, New York, NY, USA, 2010. ACM. [11] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov chains and mixing times. American Mathematical So- ciety, 2009. [12] D. Ramage, S. Dumais, and D. Liebling. Characteriz- ing microblogs with topic models. In ICWSM, 2010. [13] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled LDA: A supervised topic model for credit at- tribution in multi-labeled corpora. In Proceedings of