=Paper= {{Paper |id=Vol-2540/article_3 |storemode=property |title=Autoencoding variational Bayes for latent Dirichlet allocation |pdfUrl=https://ceur-ws.org/Vol-2540/FAIR2019_paper_33.pdf |volume=Vol-2540 |authors=Zach Wolpe,Alta De Waal |dblpUrl=https://dblp.org/rec/conf/fair2/WolpeW19 }} ==Autoencoding variational Bayes for latent Dirichlet allocation== https://ceur-ws.org/Vol-2540/FAIR2019_paper_33.pdf

Autoencoding variational Bayes for latent
Dirichlet allocation

Zach Wolpe1 and Alta de Waal12
1
Department of Statistics, University of Pretoria
2
Center for Artificial Intelligence Research (CAIR)

Abstract. Many posterior distributions take intractable forms and thus
require variational inference where analytical solutions cannot be found.
Variational Inference and Monte Carlo Markov Chains (MCMC) are es-
tablished mechanism to approximate these intractable values. An alter-
native approach to sampling and optimisation for approximation is a di-
rect mapping between the data and posterior distribution. This is made
possible by recent advances in deep learning methods. Latent Dirichlet
Allocation (LDA) is a model which offers an intractable posterior of this
nature. In LDA latent topics are learnt over unlabelled documents to
soft cluster the documents. This paper assesses the viability of learning
latent topics leveraging an autoencoder (in the form of Autoencoding
variational Bayes) and compares the mimicked posterior distributions to
that achieved by VI. After conducting various experiments the proposed
AEVB delivers inadequate performance. Under Utopian conditions com-
parable conclusion are achieved which are generally unattainable. Fur-
ther, model specification becomes increasingly complex and deeply cir-
cumstantially dependant - which is in itself not a deterrent but does war-
rant consideration. In a recent study, these concerns were highlighted and
discussed theoretically. We confirm the argument empirically by dissect-
ing the autoencoder’s iterative process. In investigating the autoencoder,
we see performance degrade as models grow in dimensionality. Visual-
ization of the autoencoder reveals a bias towards the initial randomised
topics.

Keywords: Autoencoders · Variational Inference · Latent Dirichlet Al-
location · Natural Language Processing · Deep Learning .

1 Introduction

High dimensional data such as text, speech, images and spatiotemporal data are
typically labelled as big data, not only because of high volumes, but also because
of veracity and velocity. It is for these reasons that unsupervised representations
are becoming more in demand in order to project the data onto a lower dimen-
sional space that is more manageable. Most often, this involves the computation
of a posterior distribution which comes at a high computational expense. One
such method is topic modelling which infers latent semantic representations of

Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0)
2 Z Wolpe, A de Waal

text. The high dimensional integrals of the posterior predictive posterior dis-
tribution of a topic model is intractable and approximation techniques such as
sampling (Markov Chain Monte Carlo) or optimization (variational inference)
are standard approaches to approximate these integrals. MCMC samples from
the proportionate posterior and is guaranteed to converge to the true posterior
given enough data and computational time [12]. However, the associated com-
putational costs associated with MCMC makes it impractical for large and high
dimensional corpora. On the other hand, variational inference simplifies the es-
timation procedure by approximating the posterior with a solvable solution [2],
but are known for underestimating the posterior variance. Furthermore, for any
new topic model with slightly different assumptions, the inference updates for
both these techniques need to be derived theoretically.

An alternative approach to sampling and optimization, is to directly map
input data to an approximate posterior distribution. This is called an inference
network and was introduced by Dayan et al.[4] in 1995. An autoencoding varia-
tional Bayes (AEVB) algorithm, or variational autoencoder, trains an inference
network [14] to perform this mapping and thereby mimicking the effect of prob-
abilistic inference [15]. Using Automatic Differentiation Variational Inference
(ADVI) [7] in combination with AEVB, posterior inference can be performed on
almost any continuous latent variable model.

In this paper we describe and investigate the implementation of an auten-
coder variational Bayes (AEVB) for LDA. We are specifically interested in the
quality of posterior distributions it produces. Related work [15] has indicated
that a straightforward AEVB implementation does not produce meaningful top-
ics. The two main challenges stated by the authors are the fact that the Dirichlet
prior is not a location scale family, and thereby making the reparameterisation
problematic. Secondly, because of component collapsing, the inference network
becomes stuck in a bad local optimum in which all the topics are indentical.
Although Srivastave & Sutton [15] provided a this explanation as well as pro-
duced a solution to the problem, our aim is to take a step back and analyse the
behaviour of the AEVB on topic models empirically. Our experiments confirm
the issues raised by [15] and based on that, we dissect the autoencoder’s iter-
ative process in order to understand how and when the autoencoder allocates
documents to topics.

The structure of the paper is as follows: In Section 2 we provide background
theory on LDA and in section 3 we introduce AEVB in LDA before defining
the experiments in Section 4. Section 5 is a dedicated discussion into the the
AEVB’s performance which is followed by conclusions in section 6.
Autoencoding variational Bayes for latent Dirichlet allocation 3

2 Latent Dirichlet Allocation

LDA is probably the most popular topic model. In LDA, each document is prob-
abilistically assigned to each topic based on the correlation between the words
in each document. The generative process of the LDA is as follows[2]: Assuming
a corpus consists of K topics, LDA assumes each document is generated by:

1. Randomly choose K topic distributions βk ∼ Dirichlet(λβ ) over the available
dictionary - where β denotes the topic × word matrix where the probability
of the ith word belonging to the j th topic is βi,j in β.
2. For each document d = {w1 , w2 , .., wn }:
(a) Randomly choose θd , the distribution over topics: a document × topic
matrix.
(b) For each word wi randomly select a topic zn ∼ Multinomial(θd ); and
within that topic, sample a word wn ∼ Multinomial(βzn ).

Figure 1 illustrates the generative model graphically, with plates representing
iterations over documents 1, ..., M and words 1, ..., N . The shaded node w is the
only observable variable in the model. α is simply a model hyperparameter.

Fig. 1. LDA graphical model.

Under this generative model, the marginal likelihood of a document w is [15]:
Z Y k
N X
p(w|α, β) = p(wn |zn , β)p(zn |θ) p(θ|α)dθ. (1)
θ n=1 zn =1

Due to the coupling of θ and α under the multinomial assumption, posterior
inference over the hidden variables θ and z is intractable.

2.1 Mean field variational inference

As mentioned before, MCMC can be used to approximate the posterior dis-
tributions. For the scope of this paper, we focus on optimization techniques.
Mean field variational inference (VI) breaks the coupling between θ and z by
introducing the free variational parameters γ (over θ) and φ (over z). The vari-
ational posterior which best approximate the true posterior when optimized is
4 Z Wolpe, A de Waal

Q
now q(θ, z|γ, φ) = qγ (θ) n qφ (zn ), and the optimization problem is to maximize
the evidence lower bound (ELBO) [6]
L(γ, φ|α, β) = DKL [q(θ, z|γ, φ)||p(θ, z|w, α, β)] − log p(w|α, β). (2)
In the above equation, DKL is the Kullback-Leibler divergence and is utilized
to minimize the distance between the variational and posterior distribution[9].
For LDA, the ELBO has closed form updates due to the conjugacy between the
Dirichlet and multinomial distributions [15]. Deriving these closed form updates
when there is a need for even slight deviations in assumptions can be cumbersome
and impractical. One example is where the practitioner wants to investigate the
Poisson instead of the multinomial as a count distribution. One can imagine the
far-reaching implications of such a deviation on the coordinate descent equations.
AEVB is a method that shows promise to sidestep this issue.

3 AEVB
Before we introduce AEVB, we first need to define autoencoding in general. We
use Figure 2 as a simple illustration. An autoencoder is a particular variant of
neural network; different in that the input matrix X is mapped to itself X̂ as
apposed to a response variable Y [12]. Clearly the response is not of interest
as - at best - it is a replication of the independent variable. The autoencoder’s
purpose is rather to examine the hidden layers [5]. If the hidden layers, repre-
sented at a lower dimensional space, are able to replicate the input variables we
have essentially encoded the same information in a lower dimensional domain.
Autoencoders are frequently used to generate data, as random numbers can be
fed to this lower dimensional encoding’s weights and biases to generate approx-
imate output X̂ that is similar to the training data. For probabilistic models
with latent variables - as in the case of LDA - it is used to infer the variational
parameters of an approximate posterior.

Fig. 2. Autoencoder illustration
Autoencoding variational Bayes for latent Dirichlet allocation 5

3.1 What makes AEVB autoencoding?

From a coding perspective, the latent variables z can be interpreted as code.
The variational posterior can be interpreted as a probabilistic encoder and the
original posterior (p(θ, z|w, α, β)) as a probabilistic decoder [6]. The first step in
defining the AEVB is to rewrite the ELBO in Eq 2 as [6]:

L(γ, φ|α, β) = −DKL [q(θ, z|γ, φ)||p(θ, z|w, α, β)] + Eq(θ,z|γ,φ) [log p(w|z, θ, α, β)].

The first term attempts to match the variational posterior over latent variables to
the prior on the latent variables [15]. The second term is crucial in the definition
of the AEVB as it ensures that the variational posterior favours values of the
latent variables that are good at explaining the data. This can be thought of the
reconstruction (or decoder) term in the autoencoder.

3.2 Stochastic Gradient Descent Estimator

Stochastic gradient descent (SGD) - a scalable variation of regular gradient de-
scent - is the optimization algorithm used to minimize the KL divergence (max-
imize the ELBO) - stochastic in that it computes an approximate gradient as
apposed to a true gradient (from a random sample) to speed computation. After
initializing the parameters of interest, gradient descent optimizes a specified loss
function by iteratively computing the gradient w.r.t each parameter; multiplying
the gradient with the learning rate and subtracting the computed quantity from
the gradient, formally:
Algorithm 1: Stochastic Gradient Descent (SGD)
Input: Training data S, learning rate η, initialization σ
Output: Model parameters Θ = (γ, φ)
γ ← 0; φ ← 0;
repeat
for (x, y) ∈ S do
∂
θ ← θ − η( ∂θ L(γ, φ|α, β);
end
until convergence;
The learning rate is normally dynamically adjusted to improve efficiency
further. The true gradient can be smoothed by adding regularization term to
improve ease of computation.

3.3 AEVB for LDA

Autoencoder Variational Bayes (AEVB) is based on ideas from Variational In-
ference (VI) to offer a potentially scalable alternative. VI works by maximizing
the ELBO (Eq. 2) where q(θ, z|γ, φ) can be thought of as the latent ‘code’ that
describes a fixed x - thus should map the input x the lower-dimensional latent
space. q(θ, z|γ, φ) - the encoder.
6 Z Wolpe, A de Waal

Optimizing the objective function tries to map the input variable x to a
specified latent space and then back to replicated the input. This structure is
indicative of an autoencoder - where the name AEVB comes from. ADVI is
used to efficient maximize the ELBO - equivalent to minimizing the original KL
divergence. ADVI utilizes Stochastic Gradient Descent, but derivatives are not
computed numerically (in the traditional sense), nor symbolically, but rather
relies on a representation of variables known as dual numbers to efficiently com-
pute gradients (the details of which are superfluous to this discussion).

But how do we parameterize q(θ, z|γ, φ) and p(θ, z|w, α, β)? We ought to
choose q(θ, z|γ, φ) such that it can approximate the true posterior p(θ, z|γ, φ),
and p(θ, z|w, α, β) such that it is flexible enough to represent a vast variety of
distributions? Parameterizing these functions with neural networks allows for
great flexibility and are efficiently optimized over large datasets.
q(θ, z|γ, φ) - the encoder - is parameterized such that the code’s dimension-
ality corresponds to a mean and variance of each topic. The parameter space of
the decoder is specified as the reciprocal of the encoder. In the case of LDA, the
weights and biases of the encoder are specified as:

W0 b0 W1 b1
w×h h×1 h×2∗(k−1) 2∗(k−1)×1

where w is the number of words, h the number of hidden layers and k the number
of topics. A uniform Dirichlet prior with α = k (number of topics) is specified.
The objective (ELBO) is then maximized by ADVI and model parameters are
learnt.

4 Experiments
A number of experiments were conducted in aid of answering the follow research
questions:
1. Does the AEVB LDA provide comparable results - and predictive capability
- to the well establish - VI learnt - LDA?
2. Does the autoencoder offer significant processing-efficiency advantages as
datasets scale?
3. What drawbacks does the autoencoder suffer?

4.1 Dataset
The experiments were conducted on the 20 Newsgroups [1] dataset - as its known
structure serves as aid in diagnosing performance. The dataset consists of 18000
documents that consist of a diverse variety of topics ranging including ‘medical’,
‘space’, ‘religion’, ‘computer science’ and many more. The corpus was vectorized
to a bag-of-words representation after some standard pre-processing including:
removing stopwords, lemmatizing, tokenization pruning uncommon or overly
common words. Finally, the vocabulary was reduced to only contain 1000 of the
most frequent words.
Autoencoding variational Bayes for latent Dirichlet allocation 7

4.2 Model architecture

An LDA model with w tokens, D documents and K topics requires learning
two matrices. θ describing the topic distribution over documents and β
D×K K×V
portraying the word distribution over topics. To learn the topic distribution over
documents with an autoencoder we specify the lowest dimensional hidden layers
to correspond with the dimensionality of a K dimensional simplex to soft cluster
the D documents over the K topics, that is of dimensions K − 1. However since
we want to learn distributions over the topics - leveraging a standard Gaussian -
so we specify dimensionality 2×(K −1) to represent a mean µ and variance σ for
each topic. The model is constructed with 100 hidden layers of these dimensions.
A uniform Dirichlet prior is specified.
All experiments were conducted on a 2012 Macbook Air with 1,8 GHz Intel
Core i5 processor and 4GB of memory. All relevant code is well documented and
available here: https://www.zachwolpe.com/research.

4.3 Evaluation metrics

Perplexity is an intrinsic evaluation metric for topic models and an indication of
‘how surprised’ the model is to see the new document. Recent studies have shown
that perplexity scores and human judgement of topics are often not correlated,
i.e. high perplexity scores might not yield human interpretable topics [3]. An
alternative metric is topic coherence which attempts to quantify how logical
(coherent) topics are by measuring the conditional probability of words given a
topic. All flavours of topic coherence follow the general equation [11]:
X
Coherence = score(wi , wj )
i