=Paper=
{{Paper
|id=None
|storemode=property
|title=Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian Nonparametric Density Estimation
|pdfUrl=https://ceur-ws.org/Vol-962/paper09.pdf
|volume=Vol-962
|dblpUrl=https://dblp.org/rec/conf/uai/WalkerRS12
}}
==Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian Nonparametric Density Estimation==
Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian Nonparametric Density Estimation Daniel D. Walker, Kevin Seppi, and Eric K. Ringger Computer Science Department Brigham Young University Provo, UT, 84604 danw@lkers.org, {kseppi, ringger}@cs.byu.edu Abstract information that is not necessarily directly encoded in the text. Using the metadata in the inference of We propose a new supervised topic model topics provides an extra source of information, which that uses a nonparametric density estima- could lead to an improvement in modeling the topics tor to model the distribution of real-valued that are found. Prediction: given a trained supervised metadata given a topic. The model is sim- topic model and a new document with missing meta- ilar to Topics Over Time, but replaces the data, one can predict the value of the metadata vari- beta distributions used in that model with a able for that document. Even though timestamps are Dirichlet process mixture of normals. The typically included in modern, natively digital, docu- use of a nonparametric density estimator ments they may be unavailable or wrong for historical allows for the fitting of a greater class of documents that have dbeen igitized using OCR. Also, metadata densities. We compare our model even relatively modern documents can have missing with existing supervised topic models in or incorrect timestamps due to user error or system terms of prediction and show that it is ca- mis-configuration. For example, in the full Enron e- pable of discovering complex metadata dis- mail corpus1 , there are 793 email messages with a tributions in both synthetic and real data. timestamp before 1985, the year Enron was founded. Of these messages 271 have a timestamp before the year 100. Analysis: in order to understand a document 1 Introduction collection better, it is often helpful to understand how the metadata and topics are related. For example, one Supervised topic models are a class of topic models might want to analyze the development of a topic over that, in addition to modeling documents as mixtures of time, or investigate what the presence of a particular topics, each with a distribution over words, also model topic means in terms of the sentiment being expressed metadata associated with each document. Document by the author. One may, for example, plot the dis- collections often include such metadata. For example, tribution of the metadata given a topic from a trained timestamps are commonly associated with documents model. Several supervised topic models can be found that represent the time of the document’s creation. in the literature and will be discussed in more detail In the case of online product reviews, “star” ratings in Section 3. These models make assumptions about frequently accompany written reviews to quantify the the way in which the metadata are distributed given sentiment of the review’s author. the topic or require the user to specify their own as- sumptions. Usually, this approach involves using a There are three basic reasons that make supervised unimodal distribution, and the same distribution fam- topic models attractive tools for use with document ily is used to model the metadata across all topics. collections that include metadata. Better Topics: one assumption that is often true for document collections is that the topics being discussed are correlated with 1 http://www.cs.cmu.edu/˜enron These modeling assumptions are problematic. First, ing the Dirichlet-multinomial mixing distribution in it is easy to imagine metadata and topics that have the Mixture of Multinomials document model with a complex, multi-modal relationships. For example, the Chinese Restaurant Process. The CRP is the distribu- U.S. has been involved in two large conflicts with Iraq tion over partitions created by the clustering effect of over the last 20 years. A good topic model trained on the Dirichlet process [1]. So, one way of using the news text for that period should ideally discover an Dirichlet process is in model-based clustering appli- Iraq topic and successfully capture the bimodal distri- cations where it is desirable to let the number of clus- bution of that topic in time. Existing supervised topic ters be determined dynamically by the data, instead of models, however, will either group both modes into a being specified by the user. single mode, or split the two modes into two separate The DP is a distribution over probability measures G topics. Second, it seems incorrect to assume that the with two parameters: a base measure G0 and a to- metadata will be distributed similarly across all top- tal mass parameter m. Random probability measures ics. Some topics may remain fairly uniform over a drawn from a DP are generally not suitable as like- long period of time, others appear quickly and then lihoods for continuous random variates because they fade out over long periods of time (e.g., terrorism after are discrete. This complication can be overcome by 9/11), others enter the discourse gradually over time convolving the G with a continuous kernel density f (e.g., healthcare reform), still others appear and dis- [9, 5, 6]: appear in a relatively short period of time (e.g., many political scandals). G ∼ DP(m, G0 ) To address these issues, we introduce a new super- Z vised topic model, Topics Over Nonparametric Time xi |G ∼ f (xi |θ)dG(θ) (TONPT), based on the Topics Over Time (TOT) model [12]. Where TOT uses a per-topic beta distri- bution to model topic-conditional metadata distribu- This model is equivalent to an infinite mixture of f tions, TONPT uses a nonparametric density estimator, distributions with hierarchical formulation: a Dirichlet process mixture (DPM) of normals. The remainder of the paper is organized as follows: in G ∼ DP(m, G0 ) Section 2 we provide a brief discussion of the Dirich- θi |G ∼ G let process and show how a DPM of normals can be xi |θi ∼ f (xi |θ) used to approximate a wide variety of densities. Sec- tion 3 outlines related work. In Section 4 we intro- In our work we use the normal distribution for f . The duce the TONPT model and describe the collapsed normal distribution has many advantages that make it Gibbs sampler we used to efficiently conduct infer- a useful choice here. First, the parameters map intu- ence in the model on a given dataset. Section 5 de- itively to the idea that the θ parameters in the DPM are scribes experiments that were run in order to compare the “locations” of the point masses of G and so are a TONPT with two other supervised topic models and natural fit for the mean parameter of the normal distri- a baseline. Finally, in Section 6 we summarize our bution. Second, because the normal is conjugate to the results and contributions. mean of a normal with known variance, we can also choose a conjugate G0 that has intuitive parameters 2 Estimating Densities with Dirichlet and simple posterior and marginal forms. Third, the Process Mixtures normal is almost trivially extensible to multivariate cases. Fourth, the normal can be centered anywhere Significant work has been done in the document mod- on the positive or negative side of the origin which is eling community to make use of Dirichlet process not true, for example, of the gamma and beta distribu- mixtures with the goal of eliminating the need to spec- tions. Finally, just as any 1D signal can be approxi- ify the number of components in a mixture model. For mated with a sum of sine waves, almost any probabil- example, it is possible to cluster documents without ity distribution can be approximated with a weighted specifying a-priori the number of clusters by replac- sum of normal densities. α α θd θd β β zdi c zdi φj wdi td σ2 φj wdi tdi ψj Nd Nd T T T D D Figure 1: The Supervised LDA model. Figure 2: The Topics Over Time model. 3 Related Work been used successfully in several applications includ- ing modeling the voting patterns of U.S. legislators [7] and links between documents [4]. In this section we will describe the three models which are most closely related to our work. In particular, Prediction in sLDA is very straightforward, as the we focus on the issues of prediction and the posterior latent metadata variable for a document can be analysis of metadata distributions in order to highlight marginalized out to produce a vanilla LDA complete the strengths and weaknesses of each model. conditional distribution for the topic assignments. The procedure for prediction can thus be as simple as first The most closely related models to TONPT are Su- sampling the topic assignments for each word in an pervised LDA (sLDA) [3] and Topics Over Time unseen document given the assignments in the train- [12]. sLDA uses a generalized linear model (GLM) ing set, and then taking the dot product between the to regress the metadata given the topic proportions of estimated topic proportions for the document and the each document. GLMs are flexible in that they allow GLM coefficients. In terms of the representation of for the specification of a link and a dispersion func- the distribution of metadata given topics, however, the tion that can change the behavior of the regression model is somewhat lacking. The coefficients learned model. In practice, however, making such a change during inference convey only one-dimensional infor- to the model requires non-trivial modifications to the mation about the correlation between topics and the inference procedure used to learn the topics and re- metadata. A large positive coefficient for a given topic gression co-efficients. In the original sLDA paper, an indicates that documents with a higher proportion of identity link function and normal dispersion distribu- that topic tend to have higher metadata values, and a tion were used. The model, shown in Figure 1, has large negative coefficient means that documents with per-document timestamp variables td ∼ Normal(c · a higher proportion of that topic tend to have lower zd , σ 2 ), where c is the vector of linear model coeffi- metadata values. Coefficients close to zero indicate cients and zd is a topic proportion vector for document low correlation between the topic and the metadata. d (See Table 1 for a discription of the other variables in the models shown here). This configuration leads In TOT, metadata are treated as per-word observa- to a stochastic EM inference procedure in which one tions, instead of as a single per-document observa- alternately samples from the complete conditional for tion. The model, shown in Figure 2, assumes that each topic assignment, given the current values of all each per-word metadatum tdi is drawn from a per- the other variables, and then finds the regression co- topic beta distribution: tdi ∼ Beta(ψzdi 1 , ψzdi 2 ). The efficients that minimize the sum squared residual of inference procedure for TOT is a stochastic EM al- the linear prediction model. Variations of sLDA have gorithm, where the topic assignments for each word are first sampled with a collapsed Gibbs sampler and α m G0 then the shape parameters for the per-topic beta distri- butions are point estimated using the Method of Mo- ments based on the mean and variance of the metadata γjk values for the words assigned to each topic. θd Kt Prediction in TOT is not as straightforward as for T β sLDA. Like sLDA, it is possible to integrate out the zdi sdi ασ random variables directly related to the metadata and estimate a topic distribution for a held-out document using vanilla LDA inference. However, because the model does not include a document-level metadata φj wdi tdi σj2 variable, there is no obvious way to predict a single T Nd T metadata value for held-out documents. We describe D βσ a prediction procedure in Section 5, based on work one by Wang and McCallum, that yields acceptable Figure 3: TONPT as used in sampling. results in practice. Despite having a more complicated prediction proce- discussion of supervised topic models. dure, TOT yields a much richer picture of the trends present in the data. It is possible with TOT, for exam- ple, to get an idea of not only whether the metadata 4 Topics Over Nonparametric Time are correlated with a topic, but also to see the mean and variance of the per-topic metadata distributions TONPT models metadata variables associated with and even to show whether the distribution is skewed each word in the corpus as being drawn from a topic- or symmetric. specific Dirichlet process mixture of normals. In addi- tion, TONPT employs a common base measure G0 for Another related model is the Dirichlet Multinomial all of the per-topic DPMs, for which we use a normal Regression (DMR) model [11]. Whereas the sLDA with mean µ0 and variance σ02 . and TOT models both model the metadata genera- tively, i.e., as random variables conditioned on the The random variables are distributed as follows: topic assignments for a document, the DMR forgoes modeling the metadata explicitly, putting the metadata θd |α ∼ Dirichlet(α) variables at the “root” of the graphical model and con- φt |β ∼ Dirichlet(β) ditioning the document distributions over topics on the zdi |θ ∼ Categorical(θd ) metadata values. By forgoing a direct modeling of the wdi |zdi , φ ∼ Categorical(φzdi ) metadata, the DMR is able to take advantage of a wide range of metadata types and even to include multiple σj2 |ασ , βσ ∼ InverseGamma(ασ , βσ ) metadata measurements (or “features”) per document. Gj |G0 , m ∼ DP(G0 , m) The authors show how, conditioning on the metadata, Z the DMR is able to outperform other supervised topic tdi |Gzdi , σzdi ∼ f (tdi ; γ, σz2di )dGzdi (γ) 2 models in terms of its ability to fit the observed words of held-out documents, yielding lower perplexity val- where f (·; γ, σ 2 ) is the normal p.d.f. with mean γ and ues. The DMR is thus able to accomplish one of the variance σ 2 . Also, j ∈ {1, . . . , T }, d ∈ {1, . . . , D}, goals of supervised topic modeling very well (the in- and, given a value for d, i ∈ {1, . . . , Nd }. We note crease in topic quality). However, because it does not that, as in TOT, the fact that the metadata variable propose any distribution over metadata values, it is is repeated per-word leads to a deficient generative difficult to conduct the types of analyses or missing model, because the metadata are typically observed metadata predictions possible in TOT and sLDA with- at a document level and the assumed constraint that out resorting to ad-hoc procedures. Because of these all of the metadata values for the words in a docu- deficiencies, we leave the DMR out of the remaining ment be equivalent is not modeled. The advantage of this approach is that this configuration simplifies in- ference, and also naturally balances the plurality of the word variables with the singularity of the meta- Symbol Meaning data variable, allowing the metadata to exert a simi- Common Supervised Topic Modeling Variables larly scaled influence on the topic assignments during α Prior parameter for document-topic distribu- tions inference. In addition, this modeling choice allows θd Parameter for topic mixing distribution for for a more fine-grained labeling of documents (e.g., document d at the word, phrase, or paragraph level) and for finer β Prior parameter for the topic-word distribu- tions grained prediction. For example, while timestamps φj Parameter for the jth topic-word distribution should probably be the same for all words in a docu- zdi Topic label for word i in document d ment, sentiment does not need to meet this constraint– z−di All topic assignments except that for zdi w Vector of all word token types there are often positive comments even in very nega- wdi Type of word token i in document d tive reviews. tdi Timestamp for word i in document d td Timestamp for document d This model does not lend itself well to inference and t Vector of all metadata variable values sampling because of the integral in the distribution t̂ A predicted value for the metadata variable D The number of documents over tdi . A typical modification that is made to fa- T The number of topics cilitate sampling in mixture models is to use an equiv- V The number of word types alent hierarchical model. Another modification that Nd The number of tokens in document d TONPT Specific Variables is typically made when sampling in mixture models m Total mass parameter for DP mixtures is to separate the “clustering,” or mixing, portion of sdi DP component membership for word i in doc- the distribution from the prior over mixture compo- ument d s−di All DP component assignments except that nent parameters. The mixing distribution in a DPM is for sdi the distribution known as the Chinese Restaurant Pro- G0 The base measure of the DP mixtures cess. The Chinese Restaurant Process is used to se- µ0 The mean of the base measure σ02 The variance of the base measure lect an assignment to one of the points that makes up γjk The mean of the kth mixture component for the DP point process for each data observation drawn topic j from G. The locations of these points are indepen- γ A vector of all the γ values γ −jk γ without γjk dently drawn from G0 . σj2 The variance of the components of the jth DP mixture Figure 3 shows the model that results from decom- σ2 A vector of all the DPM σ 2 s posing the Dirichlet process into these two component ασ , βσ Shape and scale parameters for prior on topic pieces. The Kj unique γ values that have been sam- σs Kj The number of unique observed γs for topic j pled so far for each topic j are drawn from G0 . The nj The number of tokens assigned to topic j sdi variables are indicator variables that take on values njk The number of tokens assigned to the kth in 1, . . . , Kj and represent which of the DPM compo- component of topic j ndj The number of tokens in document d assigned nents each tdi was drawn from. This model has the to topic j following changes to the variable distributions: njv The number of times a token of type v was assigned to topic j = k with prob ∝ nzdi ,k