<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Autoencoding variational Bayes for latent Dirichlet allocation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zach Wolpe</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alta de Waal</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Arti cial Intelligence Research</institution>
          ,
          <addr-line>CAIR</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Statistics, University of Pretoria</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Many posterior distributions take intractable forms and thus require variational inference where analytical solutions cannot be found. Variational Inference and Monte Carlo Markov Chains (MCMC) are established mechanism to approximate these intractable values. An alternative approach to sampling and optimisation for approximation is a direct mapping between the data and posterior distribution. This is made possible by recent advances in deep learning methods. Latent Dirichlet Allocation (LDA) is a model which o ers an intractable posterior of this nature. In LDA latent topics are learnt over unlabelled documents to soft cluster the documents. This paper assesses the viability of learning latent topics leveraging an autoencoder (in the form of Autoencoding variational Bayes) and compares the mimicked posterior distributions to that achieved by VI. After conducting various experiments the proposed AEVB delivers inadequate performance. Under Utopian conditions comparable conclusion are achieved which are generally unattainable. Further, model speci cation becomes increasingly complex and deeply circumstantially dependant - which is in itself not a deterrent but does warrant consideration. In a recent study, these concerns were highlighted and discussed theoretically. We con rm the argument empirically by dissecting the autoencoder's iterative process. In investigating the autoencoder, we see performance degrade as models grow in dimensionality. Visualization of the autoencoder reveals a bias towards the initial randomised topics.</p>
      </abstract>
      <kwd-group>
        <kwd>Autoencoders Variational Inference Latent Dirichlet Allocation Natural Language Processing Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        High dimensional data such as text, speech, images and spatiotemporal data are
typically labelled as big data, not only because of high volumes, but also because
of veracity and velocity. It is for these reasons that unsupervised representations
are becoming more in demand in order to project the data onto a lower
dimensional space that is more manageable. Most often, this involves the computation
of a posterior distribution which comes at a high computational expense. One
such method is topic modelling which infers latent semantic representations of
text. The high dimensional integrals of the posterior predictive posterior
distribution of a topic model is intractable and approximation techniques such as
sampling (Markov Chain Monte Carlo) or optimization (variational inference)
are standard approaches to approximate these integrals. MCMC samples from
the proportionate posterior and is guaranteed to converge to the true posterior
given enough data and computational time [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, the associated
computational costs associated with MCMC makes it impractical for large and high
dimensional corpora. On the other hand, variational inference simpli es the
estimation procedure by approximating the posterior with a solvable solution [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
but are known for underestimating the posterior variance. Furthermore, for any
new topic model with slightly di erent assumptions, the inference updates for
both these techniques need to be derived theoretically.
      </p>
      <p>
        An alternative approach to sampling and optimization, is to directly map
input data to an approximate posterior distribution. This is called an inference
network and was introduced by Dayan et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in 1995. An autoencoding
variational Bayes (AEVB) algorithm, or variational autoencoder, trains an inference
network [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to perform this mapping and thereby mimicking the e ect of
probabilistic inference [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Using Automatic Di erentiation Variational Inference
(ADVI) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in combination with AEVB, posterior inference can be performed on
almost any continuous latent variable model.
      </p>
      <p>
        In this paper we describe and investigate the implementation of an
autencoder variational Bayes (AEVB) for LDA. We are speci cally interested in the
quality of posterior distributions it produces. Related work [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] has indicated
that a straightforward AEVB implementation does not produce meaningful
topics. The two main challenges stated by the authors are the fact that the Dirichlet
prior is not a location scale family, and thereby making the reparameterisation
problematic. Secondly, because of component collapsing, the inference network
becomes stuck in a bad local optimum in which all the topics are indentical.
Although Srivastave &amp; Sutton [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] provided a this explanation as well as
produced a solution to the problem, our aim is to take a step back and analyse the
behaviour of the AEVB on topic models empirically. Our experiments con rm
the issues raised by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and based on that, we dissect the autoencoder's
iterative process in order to understand how and when the autoencoder allocates
documents to topics.
      </p>
      <p>The structure of the paper is as follows: In Section 2 we provide background
theory on LDA and in section 3 we introduce AEVB in LDA before de ning
the experiments in Section 4. Section 5 is a dedicated discussion into the the
AEVB's performance which is followed by conclusions in section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>Latent Dirichlet Allocation</title>
      <p>
        LDA is probably the most popular topic model. In LDA, each document is
probabilistically assigned to each topic based on the correlation between the words
in each document. The generative process of the LDA is as follows[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: Assuming
a corpus consists of K topics, LDA assumes each document is generated by:
1. Randomly choose K topic distributions k Dirichlet( ) over the available
dictionary - where denotes the topic word matrix where the probability
of the ith word belonging to the jth topic is i;j in .
2. For each document d = fw1; w2; ::; wng:
(a) Randomly choose d, the distribution over topics: a document topic
matrix.
(b) For each word wi randomly select a topic zn Multinomial( d); and
within that topic, sample a word wn Multinomial( zn ).
As mentioned before, MCMC can be used to approximate the posterior
distributions. For the scope of this paper, we focus on optimization techniques.
Mean eld variational inference (VI) breaks the coupling between and z by
introducing the free variational parameters (over ) and (over z). The
variational posterior which best approximate the true posterior when optimized is
now q( ; zj ; ) = q ( ) Qn q (zn), and the optimization problem is to maximize
the evidence lower bound (ELBO) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
      </p>
      <p>
        L( ; j ; ) = DKL[q( ; zj ; )jjp( ; zjw; ; )]
log p(wj ; ):
(2)
In the above equation, DKL is the Kullback-Leibler divergence and is utilized
to minimize the distance between the variational and posterior distribution[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
For LDA, the ELBO has closed form updates due to the conjugacy between the
Dirichlet and multinomial distributions [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Deriving these closed form updates
when there is a need for even slight deviations in assumptions can be cumbersome
and impractical. One example is where the practitioner wants to investigate the
Poisson instead of the multinomial as a count distribution. One can imagine the
far-reaching implications of such a deviation on the coordinate descent equations.
AEVB is a method that shows promise to sidestep this issue.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>AEVB</title>
      <p>
        Before we introduce AEVB, we rst need to de ne autoencoding in general. We
use Figure 2 as a simple illustration. An autoencoder is a particular variant of
neural network; di erent in that the input matrix X is mapped to itself X^ as
apposed to a response variable Y [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Clearly the response is not of interest
as - at best - it is a replication of the independent variable. The autoencoder's
purpose is rather to examine the hidden layers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. If the hidden layers,
represented at a lower dimensional space, are able to replicate the input variables we
have essentially encoded the same information in a lower dimensional domain.
Autoencoders are frequently used to generate data, as random numbers can be
fed to this lower dimensional encoding's weights and biases to generate
approximate output X^ that is similar to the training data. For probabilistic models
with latent variables - as in the case of LDA - it is used to infer the variational
parameters of an approximate posterior.
      </p>
      <sec id="sec-3-1">
        <title>What makes AEVB autoencoding?</title>
        <p>
          From a coding perspective, the latent variables z can be interpreted as code.
The variational posterior can be interpreted as a probabilistic encoder and the
original posterior (p( ; zjw; ; )) as a probabilistic decoder [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The rst step in
de ning the AEVB is to rewrite the ELBO in Eq 2 as [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]:
L( ; j ; ) =
        </p>
        <p>
          DKL[q( ; zj ; )jjp( ; zjw; ; )] + Eq( ;zj ; )[log p(wjz; ; ; )]:
The rst term attempts to match the variational posterior over latent variables to
the prior on the latent variables [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. The second term is crucial in the de nition
of the AEVB as it ensures that the variational posterior favours values of the
latent variables that are good at explaining the data. This can be thought of the
reconstruction (or decoder) term in the autoencoder.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Stochastic Gradient Descent Estimator</title>
        <p>Stochastic gradient descent (SGD) - a scalable variation of regular gradient
descent - is the optimization algorithm used to minimize the KL divergence
(maximize the ELBO) - stochastic in that it computes an approximate gradient as
apposed to a true gradient (from a random sample) to speed computation. After
initializing the parameters of interest, gradient descent optimizes a speci ed loss
function by iteratively computing the gradient w.r.t each parameter; multiplying
the gradient with the learning rate and subtracting the computed quantity from
the gradient, formally:</p>
        <p>end
until convergence;
Algorithm 1: Stochastic Gradient Descent (SGD)
Input: Training data S, learning rate , initialization
Output: Model parameters = ( ; )</p>
        <p>0; 0;
repeat
for (x; y) 2 S do</p>
        <p>( @@ L( ; j ; );</p>
        <p>The learning rate is normally dynamically adjusted to improve e ciency
further. The true gradient can be smoothed by adding regularization term to
improve ease of computation.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>AEVB for LDA</title>
        <p>Autoencoder Variational Bayes (AEVB) is based on ideas from Variational
Inference (VI) to o er a potentially scalable alternative. VI works by maximizing
the ELBO (Eq. 2) where q( ; zj ; ) can be thought of as the latent `code' that
describes a xed x - thus should map the input x the lower-dimensional latent
space. q( ; zj ; ) - the encoder.</p>
        <p>Optimizing the objective function tries to map the input variable x to a
speci ed latent space and then back to replicated the input. This structure is
indicative of an autoencoder - where the name AEVB comes from. ADVI is
used to e cient maximize the ELBO - equivalent to minimizing the original KL
divergence. ADVI utilizes Stochastic Gradient Descent, but derivatives are not
computed numerically (in the traditional sense), nor symbolically, but rather
relies on a representation of variables known as dual numbers to e ciently
compute gradients (the details of which are super uous to this discussion).</p>
        <p>But how do we parameterize q( ; zj ; ) and p( ; zjw; ; )? We ought to
choose q( ; zj ; ) such that it can approximate the true posterior p( ; zj ; ),
and p( ; zjw; ; ) such that it is exible enough to represent a vast variety of
distributions? Parameterizing these functions with neural networks allows for
great exibility and are e ciently optimized over large datasets.</p>
        <p>q( ; zj ; ) - the encoder - is parameterized such that the code's
dimensionality corresponds to a mean and variance of each topic. The parameter space of
the decoder is speci ed as the reciprocal of the encoder. In the case of LDA, the
weights and biases of the encoder are speci ed as:</p>
        <p>W0
w h
b0
h 1</p>
        <p>W1
h 2 (k 1)</p>
        <p>b1
2 (k 1) 1
where w is the number of words, h the number of hidden layers and k the number
of topics. A uniform Dirichlet prior with = k (number of topics) is speci ed.
The objective (ELBO) is then maximized by ADVI and model parameters are
learnt.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>A number of experiments were conducted in aid of answering the follow research
questions:
1. Does the AEVB LDA provide comparable results - and predictive capability
- to the well establish - VI learnt - LDA?
2. Does the autoencoder o er signi cant processing-e ciency advantages as
datasets scale?
3. What drawbacks does the autoencoder su er?
4.1</p>
      <sec id="sec-4-1">
        <title>Dataset</title>
        <p>
          The experiments were conducted on the 20 Newsgroups [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] dataset - as its known
structure serves as aid in diagnosing performance. The dataset consists of 18000
documents that consist of a diverse variety of topics ranging including `medical',
`space', `religion', `computer science' and many more. The corpus was vectorized
to a bag-of-words representation after some standard pre-processing including:
removing stopwords, lemmatizing, tokenization pruning uncommon or overly
common words. Finally, the vocabulary was reduced to only contain 1000 of the
most frequent words.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Model architecture</title>
        <p>An LDA model with w tokens, D documents and K topics requires learning
two matrices. describing the topic distribution over documents and</p>
        <p>D K K V
portraying the word distribution over topics. To learn the topic distribution over
documents with an autoencoder we specify the lowest dimensional hidden layers
to correspond with the dimensionality of a K dimensional simplex to soft cluster
the D documents over the K topics, that is of dimensions K 1. However since
we want to learn distributions over the topics - leveraging a standard Gaussian
so we specify dimensionality 2 (K 1) to represent a mean and variance for
each topic. The model is constructed with 100 hidden layers of these dimensions.
A uniform Dirichlet prior is speci ed.</p>
        <p>All experiments were conducted on a 2012 Macbook Air with 1,8 GHz Intel
Core i5 processor and 4GB of memory. All relevant code is well documented and
available here: https://www.zachwolpe.com/research.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Evaluation metrics</title>
        <p>
          Perplexity is an intrinsic evaluation metric for topic models and an indication of
`how surprised' the model is to see the new document. Recent studies have shown
that perplexity scores and human judgement of topics are often not correlated,
i.e. high perplexity scores might not yield human interpretable topics [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. An
alternative metric is topic coherence which attempts to quantify how logical
(coherent) topics are by measuring the conditional probability of words given a
topic. All avours of topic coherence follow the general equation [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]:
Coherence =
        </p>
        <p>X score(wi; wj )
i&lt;j</p>
        <p>
          Where words W = w1; w2; :::; wn are ordered from most to least frequently
appearing in the topic. The two leading coherence algorithms (UMass and UCI)
essentially measure the same thing [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and we have decided on UMass. The
UMass scores between fwi; wj g combinations (which are summed subsequent to
calculation) are computed as:
scorekUMass(wi; wj jK) = log
        </p>
        <p>D(wi; wj ) +</p>
        <p>D(wi)</p>
        <p>
          Where Ki is ith topic returned by the model and wi is more common than
wj (i &lt; j). D(wi) is the probability of a word wi is in a document (the number
of times wi appears in a document divided by total documents). D(wi; wj ) is
the conditional probability that wj will occur in a document, given wi is in the
document - which eludes to some sort of dependency between key words within
a topic [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. simply provides a smoothing parameter.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Results</title>
        <p>Topic model coherence was used as the primary metric in assessing model
performance. To better generalize the ndings we compute coherence for a variety
of topics K - measuring the performance as models grow in complexity. Further,
to account for the sampling distribution coherence was repeatedly computed (10
times) for the same model with di erence random samples. It is apparent from
Figure 3 that although the autoencoder matches the VI's performance under
simple textbook conditions (K = 5 topics); as models grow in complexity and
dimensionality the autoencoder's coherence scores steadily decline - note that the
labels reading pymc3 and sklearn correspond to the AEVB and VI
implementations respectively. Although LDA is an unsupervised model, the 20Newsgroups
dataset is labelled. So we have the advantage of knowing the true structure of
the dataset to be K = 20 - which coincides with the best performance using
VI. Tables 1 and 2 provides topic examples of both algorithms respectively. The
repetive top words in topics 1 and 2 in Table 2 con rms the AEVB's inability
to produce meaningful topics.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion on the autoencoder's performance</title>
      <p>A callback function was written to assess the autoencoder's iterative process.
The callback samples the distribution per epoch for the purpose of
under</p>
      <p>D K
standing how the autoencoder allocates documents to topics. Since is learnt
as a posterior distribution, we need to sample from to assess its current state.
This was performed under two variations:
1. 1 sample per epoch: to assess randomness.
2. 100000 samples per epoch: to assess the current state of .</p>
      <p>This was conducted for the rst 100 epochs to assess the initial conditions of
the autoencoder (with a constant K = 10 latent topics). Figure 4 depicts the
topic allocation for a random sample of documents - coloured by the probability
density. One would expect the distribution over topics to begin uniformly spread
and thereafter narrow to a single topic. However we observe the vast majority
of documents are even initially biasedly assigned to the rst few topics, with an
increasing density over the rst 100 epochs cementing the allocation, despite a
uniform Dirichlet prior.</p>
      <p>Inspection of the single sample depicted in Figure 5 would ideally be
completely randomized under initial conditions, however it is apparent it yields some
bias. Numerically we can assess the distribution of topic allocations in Figure
4. One would expect a far more uniformly spread distribution over the K
topics, with the autoencoder exhibiting bias as early as 5 epochs into running the
model; which is then exaggerated.</p>
      <p>Fig. 5. A single sample per epoch.
AEVB for LDA boasts the theoretical advantage of brisk computation,
however this allure evades comparisons of quality. In conducting this analysis our
contribution is two fold:
{ Perform a thorough comparison between AEVB and VI posterior inference
for LDA, uncovering the bene ts and drawback of the AEVB as an
alternative, speedy, approach.
{ Uncover AEVB's shortcomings in an attempt to deduce bias that limits its
performance - in addressing why and where it fails.</p>
      <p>After detailed experimentation it is readily apparent that the autoencoder falls
short when juxtaposed against established, well engineered techniques. When
analyzing the topic coherence performance the autoencoder o ers analogous
results to the VI LDA for simple models, however, as models grow in complexity
it is unambiguously inferior to established methods - when accounting for
sampling variability. Results are only comparable under textbook conditions. We
show that the encoder fails to adequately explore the domain spaces and
heavily biases initial random conditions. Lower predictive accuracy was also found
in the PyMC3 tutorial mentioned earlier
(https://docs.pymc.io/notebooks/ldaadvi-aevb.html). More speci cally, in this study the log-likelihood function for
topics on held-out words was use as a goodness-of- t test.</p>
      <p>
        Future work includes an implementation of the solutions o ered in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] to the
AEVB's poor performance, namely, the collapsing of the latent variable z and an
Laplace approximation to the Dirichlet prior. Furthermore, we want to capitalize
on the main proposition of ADVI approaches [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: that the derivation of closed
forms updates is not needed given the variational posterior. More speci cally,
we plan to apply AEVB on short text topic models [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. 20 newsgroups dataset, http://people.csail.mit.edu/jrennie/20Newsgroups/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent Dirichlet Allocation</article-title>
          .
          <source>Journal of machine Learning research 3(Jan)</source>
          ,
          <volume>993</volume>
          {
          <fpage>1022</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <issue>3</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerrish</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyd-Graber</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          :
          <article-title>Reading tea leaves: How humans interpret topic models</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>288</volume>
          {
          <issue>296</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dayan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neal</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.S.:</given-names>
          </string-name>
          <article-title>The helmholtz machine</article-title>
          .
          <source>Neural computation 7(5)</source>
          ,
          <volume>889</volume>
          {
          <fpage>904</fpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep Learning</article-title>
          . MIT Press (
          <year>2016</year>
          ), http://www.deeplearningbook.org
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Auto-encoding variational Bayes</article-title>
          .
          <source>arXiv preprint arXiv:1312.6114</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kucukelbir</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranganath</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.:</given-names>
          </string-name>
          <article-title>Automatic differentiation variational inference</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <volume>430</volume>
          {474 (Jan
          <year>2017</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>3122009</volume>
          .
          <fpage>3122023</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kucukelbir</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranganath</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.:</given-names>
          </string-name>
          <article-title>Automatic di erentiation variational inference</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <volume>430</volume>
          {
          <fpage>474</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kullback</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <source>Information Theory and Statistics</source>
          . Wiley, New York (
          <year>1959</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mazarura</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Waal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A comparison of the performance of latent dirichlet allocation and the dirichlet multinomial mixture model on short text</article-title>
          .
          <source>In: 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech)</source>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mimno</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wallach</surname>
            ,
            <given-names>H.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talley</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leenders</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Optimizing semantic coherence in topic models</article-title>
          .
          <source>In: Proceedings of the conference on empirical methods in natural language processing</source>
          . pp.
          <volume>262</volume>
          {
          <fpage>272</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.P.:</given-names>
          </string-name>
          <article-title>Machine learning : a probabilistic perspective</article-title>
          . MIT Press, Cambridge, Mass. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lau</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grieser</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Automatic evaluation of topic coherence</article-title>
          . In: Human Language Technologies:
          <article-title>The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          . pp.
          <volume>100</volume>
          {
          <fpage>108</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rezende</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wierstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Stochastic backpropagation and approximate inference in deep generative models</article-title>
          .
          <source>arXiv preprint arXiv:1401.4082</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Autoencoding Variational Inference For Topic Models</article-title>
          . arXiv:
          <volume>1703</volume>
          .01488 [stat] (
          <year>Mar 2017</year>
          ), http://arxiv.org/abs/1703.01488
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>