=Paper= {{Paper |id=None |storemode=property |title=Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian Nonparametric Density Estimation |pdfUrl=https://ceur-ws.org/Vol-962/paper09.pdf |volume=Vol-962 |dblpUrl=https://dblp.org/rec/conf/uai/WalkerRS12 }} ==Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian Nonparametric Density Estimation== https://ceur-ws.org/Vol-962/paper09.pdf
 Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian
                    Nonparametric Density Estimation



                             Daniel D. Walker, Kevin Seppi, and Eric K. Ringger
                                        Computer Science Department
                                         Brigham Young University
                                             Provo, UT, 84604
                               danw@lkers.org, {kseppi, ringger}@cs.byu.edu


                      Abstract                            information that is not necessarily directly encoded
                                                          in the text. Using the metadata in the inference of
     We propose a new supervised topic model              topics provides an extra source of information, which
     that uses a nonparametric density estima-            could lead to an improvement in modeling the topics
     tor to model the distribution of real-valued         that are found. Prediction: given a trained supervised
     metadata given a topic. The model is sim-            topic model and a new document with missing meta-
     ilar to Topics Over Time, but replaces the           data, one can predict the value of the metadata vari-
     beta distributions used in that model with a         able for that document. Even though timestamps are
     Dirichlet process mixture of normals. The            typically included in modern, natively digital, docu-
     use of a nonparametric density estimator             ments they may be unavailable or wrong for historical
     allows for the fitting of a greater class of         documents that have dbeen igitized using OCR. Also,
     metadata densities. We compare our model             even relatively modern documents can have missing
     with existing supervised topic models in             or incorrect timestamps due to user error or system
     terms of prediction and show that it is ca-          mis-configuration. For example, in the full Enron e-
     pable of discovering complex metadata dis-           mail corpus1 , there are 793 email messages with a
     tributions in both synthetic and real data.          timestamp before 1985, the year Enron was founded.
                                                          Of these messages 271 have a timestamp before the
                                                          year 100. Analysis: in order to understand a document
1   Introduction                                          collection better, it is often helpful to understand how
                                                          the metadata and topics are related. For example, one
Supervised topic models are a class of topic models       might want to analyze the development of a topic over
that, in addition to modeling documents as mixtures of    time, or investigate what the presence of a particular
topics, each with a distribution over words, also model   topic means in terms of the sentiment being expressed
metadata associated with each document. Document          by the author. One may, for example, plot the dis-
collections often include such metadata. For example,     tribution of the metadata given a topic from a trained
timestamps are commonly associated with documents         model. Several supervised topic models can be found
that represent the time of the document’s creation.       in the literature and will be discussed in more detail
In the case of online product reviews, “star” ratings     in Section 3. These models make assumptions about
frequently accompany written reviews to quantify the      the way in which the metadata are distributed given
sentiment of the review’s author.                         the topic or require the user to specify their own as-
                                                          sumptions. Usually, this approach involves using a
There are three basic reasons that make supervised        unimodal distribution, and the same distribution fam-
topic models attractive tools for use with document       ily is used to model the metadata across all topics.
collections that include metadata. Better Topics: one
assumption that is often true for document collections
is that the topics being discussed are correlated with       1
                                                                 http://www.cs.cmu.edu/˜enron
These modeling assumptions are problematic. First,          ing the Dirichlet-multinomial mixing distribution in
it is easy to imagine metadata and topics that have         the Mixture of Multinomials document model with a
complex, multi-modal relationships. For example, the        Chinese Restaurant Process. The CRP is the distribu-
U.S. has been involved in two large conflicts with Iraq     tion over partitions created by the clustering effect of
over the last 20 years. A good topic model trained on       the Dirichlet process [1]. So, one way of using the
news text for that period should ideally discover an        Dirichlet process is in model-based clustering appli-
Iraq topic and successfully capture the bimodal distri-     cations where it is desirable to let the number of clus-
bution of that topic in time. Existing supervised topic     ters be determined dynamically by the data, instead of
models, however, will either group both modes into a        being specified by the user.
single mode, or split the two modes into two separate
                                                            The DP is a distribution over probability measures G
topics. Second, it seems incorrect to assume that the
                                                            with two parameters: a base measure G0 and a to-
metadata will be distributed similarly across all top-
                                                            tal mass parameter m. Random probability measures
ics. Some topics may remain fairly uniform over a
                                                            drawn from a DP are generally not suitable as like-
long period of time, others appear quickly and then
                                                            lihoods for continuous random variates because they
fade out over long periods of time (e.g., terrorism after
                                                            are discrete. This complication can be overcome by
9/11), others enter the discourse gradually over time
                                                            convolving the G with a continuous kernel density f
(e.g., healthcare reform), still others appear and dis-
                                                            [9, 5, 6]:
appear in a relatively short period of time (e.g., many
political scandals).
                                                                              G ∼ DP(m, G0 )
To address these issues, we introduce a new super-                                Z
vised topic model, Topics Over Nonparametric Time                         xi |G ∼ f (xi |θ)dG(θ)
(TONPT), based on the Topics Over Time (TOT)
model [12]. Where TOT uses a per-topic beta distri-
bution to model topic-conditional metadata distribu-        This model is equivalent to an infinite mixture of f
tions, TONPT uses a nonparametric density estimator,        distributions with hierarchical formulation:
a Dirichlet process mixture (DPM) of normals.
The remainder of the paper is organized as follows: in                          G ∼ DP(m, G0 )
Section 2 we provide a brief discussion of the Dirich-                       θi |G ∼ G
let process and show how a DPM of normals can be                             xi |θi ∼ f (xi |θ)
used to approximate a wide variety of densities. Sec-
tion 3 outlines related work. In Section 4 we intro-
                                                            In our work we use the normal distribution for f . The
duce the TONPT model and describe the collapsed
                                                            normal distribution has many advantages that make it
Gibbs sampler we used to efficiently conduct infer-
                                                            a useful choice here. First, the parameters map intu-
ence in the model on a given dataset. Section 5 de-
                                                            itively to the idea that the θ parameters in the DPM are
scribes experiments that were run in order to compare
                                                            the “locations” of the point masses of G and so are a
TONPT with two other supervised topic models and
                                                            natural fit for the mean parameter of the normal distri-
a baseline. Finally, in Section 6 we summarize our
                                                            bution. Second, because the normal is conjugate to the
results and contributions.
                                                            mean of a normal with known variance, we can also
                                                            choose a conjugate G0 that has intuitive parameters
2   Estimating Densities with Dirichlet                     and simple posterior and marginal forms. Third, the
    Process Mixtures                                        normal is almost trivially extensible to multivariate
                                                            cases. Fourth, the normal can be centered anywhere
Significant work has been done in the document mod-         on the positive or negative side of the origin which is
eling community to make use of Dirichlet process            not true, for example, of the gamma and beta distribu-
mixtures with the goal of eliminating the need to spec-     tions. Finally, just as any 1D signal can be approxi-
ify the number of components in a mixture model. For        mated with a sum of sine waves, almost any probabil-
example, it is possible to cluster documents without        ity distribution can be approximated with a weighted
specifying a-priori the number of clusters by replac-       sum of normal densities.
                         α                                                      α



                         θd                                                     θd

           β                                                      β
                        zdi                   c                                 zdi


           φj           wdi        td        σ2                   φj           wdi        tdi               ψj

                              Nd                                                                Nd               T
                T                                                      T
                                        D                                                            D

       Figure 1: The Supervised LDA model.                         Figure 2: The Topics Over Time model.


3   Related Work                                            been used successfully in several applications includ-
                                                            ing modeling the voting patterns of U.S. legislators [7]
                                                            and links between documents [4].
In this section we will describe the three models which
are most closely related to our work. In particular,        Prediction in sLDA is very straightforward, as the
we focus on the issues of prediction and the posterior      latent metadata variable for a document can be
analysis of metadata distributions in order to highlight    marginalized out to produce a vanilla LDA complete
the strengths and weaknesses of each model.                 conditional distribution for the topic assignments. The
                                                            procedure for prediction can thus be as simple as first
The most closely related models to TONPT are Su-
                                                            sampling the topic assignments for each word in an
pervised LDA (sLDA) [3] and Topics Over Time
                                                            unseen document given the assignments in the train-
[12]. sLDA uses a generalized linear model (GLM)
                                                            ing set, and then taking the dot product between the
to regress the metadata given the topic proportions of
                                                            estimated topic proportions for the document and the
each document. GLMs are flexible in that they allow
                                                            GLM coefficients. In terms of the representation of
for the specification of a link and a dispersion func-
                                                            the distribution of metadata given topics, however, the
tion that can change the behavior of the regression
                                                            model is somewhat lacking. The coefficients learned
model. In practice, however, making such a change
                                                            during inference convey only one-dimensional infor-
to the model requires non-trivial modifications to the
                                                            mation about the correlation between topics and the
inference procedure used to learn the topics and re-
                                                            metadata. A large positive coefficient for a given topic
gression co-efficients. In the original sLDA paper, an
                                                            indicates that documents with a higher proportion of
identity link function and normal dispersion distribu-
                                                            that topic tend to have higher metadata values, and a
tion were used. The model, shown in Figure 1, has
                                                            large negative coefficient means that documents with
per-document timestamp variables td ∼ Normal(c ·
                                                            a higher proportion of that topic tend to have lower
zd , σ 2 ), where c is the vector of linear model coeffi-
                                                            metadata values. Coefficients close to zero indicate
cients and zd is a topic proportion vector for document
                                                            low correlation between the topic and the metadata.
d (See Table 1 for a discription of the other variables
in the models shown here). This configuration leads         In TOT, metadata are treated as per-word observa-
to a stochastic EM inference procedure in which one         tions, instead of as a single per-document observa-
alternately samples from the complete conditional for       tion. The model, shown in Figure 2, assumes that
each topic assignment, given the current values of all      each per-word metadatum tdi is drawn from a per-
the other variables, and then finds the regression co-      topic beta distribution: tdi ∼ Beta(ψzdi 1 , ψzdi 2 ). The
efficients that minimize the sum squared residual of        inference procedure for TOT is a stochastic EM al-
the linear prediction model. Variations of sLDA have        gorithm, where the topic assignments for each word
are first sampled with a collapsed Gibbs sampler and                            α         m                 G0
then the shape parameters for the per-topic beta distri-
butions are point estimated using the Method of Mo-
ments based on the mean and variance of the metadata                                                        γjk
values for the words assigned to each topic.                                   θd
                                                                                                                   Kt
Prediction in TOT is not as straightforward as for                                                                  T
                                                                 β
sLDA. Like sLDA, it is possible to integrate out the
                                                                               zdi        sdi               ασ
random variables directly related to the metadata and
estimate a topic distribution for a held-out document
using vanilla LDA inference. However, because the
model does not include a document-level metadata
                                                                φj             wdi        tdi                σj2
variable, there is no obvious way to predict a single                 T                         Nd                 T
metadata value for held-out documents. We describe                                                   D      βσ
a prediction procedure in Section 5, based on work
one by Wang and McCallum, that yields acceptable                     Figure 3: TONPT as used in sampling.
results in practice.
Despite having a more complicated prediction proce-
                                                           discussion of supervised topic models.
dure, TOT yields a much richer picture of the trends
present in the data. It is possible with TOT, for exam-
ple, to get an idea of not only whether the metadata       4   Topics Over Nonparametric Time
are correlated with a topic, but also to see the mean
and variance of the per-topic metadata distributions       TONPT models metadata variables associated with
and even to show whether the distribution is skewed        each word in the corpus as being drawn from a topic-
or symmetric.                                              specific Dirichlet process mixture of normals. In addi-
                                                           tion, TONPT employs a common base measure G0 for
Another related model is the Dirichlet Multinomial         all of the per-topic DPMs, for which we use a normal
Regression (DMR) model [11]. Whereas the sLDA              with mean µ0 and variance σ02 .
and TOT models both model the metadata genera-
tively, i.e., as random variables conditioned on the       The random variables are distributed as follows:
topic assignments for a document, the DMR forgoes
modeling the metadata explicitly, putting the metadata                     θd |α ∼ Dirichlet(α)
variables at the “root” of the graphical model and con-                    φt |β ∼ Dirichlet(β)
ditioning the document distributions over topics on the                    zdi |θ ∼ Categorical(θd )
metadata values. By forgoing a direct modeling of the                 wdi |zdi , φ ∼ Categorical(φzdi )
metadata, the DMR is able to take advantage of a wide
range of metadata types and even to include multiple                  σj2 |ασ , βσ ∼ InverseGamma(ασ , βσ )
metadata measurements (or “features”) per document.                   Gj |G0 , m ∼ DP(G0 , m)
The authors show how, conditioning on the metadata,
                                                                                   Z
the DMR is able to outperform other supervised topic             tdi |Gzdi , σzdi ∼ f (tdi ; γ, σz2di )dGzdi (γ)
                                                                              2

models in terms of its ability to fit the observed words
of held-out documents, yielding lower perplexity val-      where f (·; γ, σ 2 ) is the normal p.d.f. with mean γ and
ues. The DMR is thus able to accomplish one of the         variance σ 2 . Also, j ∈ {1, . . . , T }, d ∈ {1, . . . , D},
goals of supervised topic modeling very well (the in-      and, given a value for d, i ∈ {1, . . . , Nd }. We note
crease in topic quality). However, because it does not     that, as in TOT, the fact that the metadata variable
propose any distribution over metadata values, it is       is repeated per-word leads to a deficient generative
difficult to conduct the types of analyses or missing      model, because the metadata are typically observed
metadata predictions possible in TOT and sLDA with-        at a document level and the assumed constraint that
out resorting to ad-hoc procedures. Because of these       all of the metadata values for the words in a docu-
deficiencies, we leave the DMR out of the remaining        ment be equivalent is not modeled. The advantage of
                                                                this approach is that this configuration simplifies in-
                                                                ference, and also naturally balances the plurality of
                                                                the word variables with the singularity of the meta-
 Symbol         Meaning                                         data variable, allowing the metadata to exert a simi-
        Common Supervised Topic Modeling Variables              larly scaled influence on the topic assignments during
 α              Prior parameter for document-topic distribu-
                tions
                                                                inference. In addition, this modeling choice allows
 θd             Parameter for topic mixing distribution for     for a more fine-grained labeling of documents (e.g.,
                document d                                      at the word, phrase, or paragraph level) and for finer
 β              Prior parameter for the topic-word distribu-
                tions                                           grained prediction. For example, while timestamps
 φj             Parameter for the jth topic-word distribution   should probably be the same for all words in a docu-
 zdi            Topic label for word i in document d            ment, sentiment does not need to meet this constraint–
 z−di           All topic assignments except that for zdi
 w              Vector of all word token types                  there are often positive comments even in very nega-
 wdi            Type of word token i in document d              tive reviews.
 tdi            Timestamp for word i in document d
 td             Timestamp for document d                        This model does not lend itself well to inference and
 t              Vector of all metadata variable values          sampling because of the integral in the distribution
 t̂             A predicted value for the metadata variable
 D              The number of documents                         over tdi . A typical modification that is made to fa-
 T              The number of topics                            cilitate sampling in mixture models is to use an equiv-
 V              The number of word types                        alent hierarchical model. Another modification that
 Nd             The number of tokens in document d
                   TONPT Specific Variables                     is typically made when sampling in mixture models
 m              Total mass parameter for DP mixtures            is to separate the “clustering,” or mixing, portion of
 sdi            DP component membership for word i in doc-      the distribution from the prior over mixture compo-
                ument d
 s−di           All DP component assignments except that        nent parameters. The mixing distribution in a DPM is
                for sdi                                         the distribution known as the Chinese Restaurant Pro-
 G0             The base measure of the DP mixtures             cess. The Chinese Restaurant Process is used to se-
 µ0             The mean of the base measure
 σ02            The variance of the base measure                lect an assignment to one of the points that makes up
 γjk            The mean of the kth mixture component for       the DP point process for each data observation drawn
                topic j                                         from G. The locations of these points are indepen-
 γ              A vector of all the γ values
 γ −jk          γ without γjk                                   dently drawn from G0 .
 σj2            The variance of the components of the jth DP
                mixture                                         Figure 3 shows the model that results from decom-
 σ2             A vector of all the DPM σ 2 s                   posing the Dirichlet process into these two component
 ασ , βσ        Shape and scale parameters for prior on topic   pieces. The Kj unique γ values that have been sam-
                σs
 Kj             The number of unique observed γs for topic j    pled so far for each topic j are drawn from G0 . The
 nj             The number of tokens assigned to topic j        sdi variables are indicator variables that take on values
 njk            The number of tokens assigned to the kth        in 1, . . . , Kj and represent which of the DPM compo-
                component of topic j
 ndj            The number of tokens in document d assigned     nents each tdi was drawn from. This model has the
                to topic j                                      following changes to the variable distributions:
 njv            The number of times a token of type v was
                assigned to topic j
                                                                                         
                                                                                         = k with prob ∝ nzdi ,k