=Paper=
{{Paper
|id=Vol-1308/paper5
|storemode=property
|title=Faster PLSA
|pdfUrl=https://ceur-ws.org/Vol-1308/paper5.pdf
|volume=Vol-1308
|dblpUrl=https://dblp.org/rec/conf/syrcodis/AvanesovK14
}}
==Faster PLSA==
Faster PLSA
Valeriy Avanesov Ilya Kozlov
The Institute for System Programming (ISP) of the Russian Academy of Sciences
kozlov-ilya@ispras.ru, avanesov@ispras.ru
Abstract Denote ϕwt = p(w|t) and θtd = p(t|d). One may obtain
ϕwt and θtd as solution of optimization problem
Probabilistic Latent Semantic Analysis (PLSA) XX X
is an effective technique for information re- L= log ϕwt θtd → max (2)
trieval, but it has a serious drawback: it con- d∈D w∈d t
sumes a huge amount of computational re-
with boundary
sources, so it is hard to train this model on a
large collection of documents. The aim of this X X
paper is to improve time efficiency of the train- ∀t ϕwt = 1, ∀d θtd = 1 (3)
w w
ing algorithm. Two different approaches are
explored: one is based on efficient finding of and
an appropriate initial approximation, the idea of ∀t, w ϕwt ≥ 0, ∀d, t θwt ≥ 0 (4)
another is that for the most of collection topics
may be extracted from relatively small fraction 1.3 Topic modeling as matrix decomposition
of the data.
1.3.1 Kullback-Leibler divergence
1 Introduction Kullback-Leibler divergence is a non-negative measure
Topic modeling is an application of machine learning of diffrence between two different probability distribu-
to text analysis. Topic modeling is useful for different tion:
n
text analysis tasks, for example: document categoriza-
X pi
KL(pi ||qi ) = pi ln (5)
tion [1], spam detection [2], phishing detection [3] and i=1
qi
many other applications.
One of the widespread algorithms is Probabilistic Latent Consider an empirical distribution p̂i and some paramet-
Semantic Analysis (PLSA), suggested by Thomas Hof- ric distribution qi = qi (α) which is used to explain p̂i
mann in [4]. . Easy to see that in this case minimization of KL–
divergence is equivalent to estimation of α by maximum-
1.1 Generative model likelihood:
n
PLSA is based on generative model ”bag of words”: ev- X pi
KL(pi ||qi (α)) = pi ln → min (6)
ery document is assumed to be a multinomial distribu- qi (α) α
i=1
tion over topics. Every topic is a multinomial distribution
over words. Generation model may be defined as follow: n
X
⇔ pi ln(qi (α)) → max
• For every position in document d i.i.d choose topic i=1
α
t from distribution of topics by document
Thus one can easily see that (2) equivalent to weighted
• Choose word w from topic t Kullback-Leibler divergence minimization:
The aim of topic modeling is to recover topics and distri-
!
X ndw X
bution of document by topics. nd KLw || ϕwt θtd → min (7)
nd Φ,Θ
d∈D t∈T
1.2 Topic modeling as optimization problem
where nwd – number of words w in document d, nd –
According to generative model one can estimate proba- number of words in document d.
bility to observe collection D as:
Y YX 1.3.2 Matrix decomposition
p(D) = p(t|d)p(w|t) (1)
t
Denote empirical distribution of words by document as
d∈D w∈d
p̂(w, d) = nnwd
d
. According to this notation one can con-
Proceedings of the Tenth Spring Researcher’s Colloquium on sider the problem (2) as matrix decomposition:
Databases and Information Systems, Veliky Novgorod, Rus-
sia, 2014 F ≈KL ΦΘ (8)
where matrix F = (p̂(w, d))W ×D is empirical distri- words) can be decreased by text normalization (remov-
bution of words by document, matrix Φ = (ϕwt )W ×D ing of stop-words, lowercasing, etc). Number of itera-
is distribution of words by topics and matrix Θ = tions until convergence depends of initial approximation
(θtd )T ×D is distribution of documents by topics. Thus, of PLSA parameters Φ and Θ, so a good initial approx-
our optimization problem may be rewritten in Kullback- imation can reduce the number of iterations until con-
Leibler notation as vergence. The current study presents an efficient ap-
proach to find a beneficial initial approximation. The
KL(F, ΦΘ) → min (9) other method of computation time reduction based on
idea that matrix Φ may be obtained on small represen-
Thus PLSA may be observed as stochastic matrix de- tative part of documents collection.
composition.
1.4 Expectation-Maximization algorithm 2 Related Work
Unfortunately (2) has no analytical solution. Thus we The original algorithm was described in 1999 in [4].
use Expectation - Maximization (EM) algorithm. This Since 1999 numerous papers were devoted to PLSA,
algorithm consists of two steps: but only a few of them are devoted to time efficiency
improvement. In [5] authors improve time efficiency by
1. Estimation of number ndwt of words w, produced parallelization of the algorithm using OpenMP. Authors
by topic t in document d. (E - step) report 6 times speed-up on 8 CPU machine. Work [6]
improves the result of a previous work by using MPI.
2. Optimization of distribution of documents by topics But both of these studies try to solve the problem of time
and optimization of distribution of topics by words efficiency purely by programming methods.
relying on the ndwt values obtained during E - step
. (M - step) In [7] Farahat uses LSA for finding an initialization
One can estimate ndwt as follows: for PLSA. LSA is based on SVD 1 matrix decomposition
in L2 norm and lacks probabilistic interpretation. PLSA
nwd p(w|t)p(t|d) performs stochastic matrix decomposition based on
ndwt = P (10) Kullback-Leibler divergence and has a simple probabil-
t p(w|t)p(t|d)
ity interpretation, but it inherits the problem of every
where nwd – number of words w in document d. Thus, non-convex optimization algorithm: it may converge to
probability p(w|t) may be estimated as a local minimum instead of the global one. Combination
P of LSA and PLSA leverages the best features of these
nwt d ndwt models: usage of LSA training result as an initial
p(w|t) = = P P (11) approximation helps to avoid convergence to a poor
nt w d ndwt
local minimum. But the problem of time efficiency is
Where nwt – number of words w, produced by topic t: not explored. In [7] it is shown that L2 norm usage is
X appropriate to find an initialization for PLSA inference
nwt = ndwt (12) algorithm, we will use this result in our work.
d∈D An idea of obtaining a distribution over topics for a
document not included in a collection that PLSA was
nt – number of words, produced by topic t initially trained on was expressed in [8]. The author
X suggests to perform this through EM-scheme holding
nt = nwt (13) matrix Φ fixed. However, he proposes this method only
w∈V for query processing but not for PLSA training speed-up.
Similarly for p(t|d):
ntd 3 Proposed approach
p(t|d) = (14)
nd In this work we present two different approach for com-
Where nd – number of words in document d, ntd – es- putational time reduction. One is based on finding initial
timated number of words in document d, produced by approximation and reduction of number of iteration to
topic t: convergence. The other is based on obtaining Φpart on
X representative sample, fixing Φpart and then obtaining Θ
ntd = ndwt (15)
on whole collection.
w∈V
As one can see, the asymptotic time of this algorithm 3.1 Finding initial approximation
is O(D×V ×T ×I) where D – the number of documents,
V – average number of distinct words in document, T - In this work we do not use LSA nor clustering methods.
– number of topics and I – number of iterations until Instead we take a subset of our collection (for example
convergence. Inference of the PLSA on a large dataset 10%), apply PLSA to this sample and calculate an ini-
requires a lot of time thus the methods of decreasing of tial approximation using obtained matrix Φ. Computa-
computation time are important. 1 SVD – Singular value decomposition, a factorization of a matrix
Number of topics and number of documents are defined into the product of a unitary matrix, a diagonal matrix, and another
by application. Size of vocabulary (number of distinct unitary matrix
tion time of PLSA is proportional to the number of doc- L2 would not be a solution in our space with Kullback-
uments in collection and training PLSA on 10% part of -Leibler divergence, but we are not looking for the exact
collection is at least ten times faster than on the whole solution, but for an appropriate initialization.
collection (per iteration). Assume that all words are replaced by their serial num-
bers in the vocabulary. Let us consider topics and doc-
3.1.1 Taking a sample uments as vector in RV , where V stands for vocabulary
size. i − th coordinate in vector-document represents the
In order to take a representative sample we need to take
number of times a word with the number i occurs in the
a random sample. The exact size of sample is not im-
document:
portant so we use a rather simple scheme: we include
documents independently with probability 10%. So we d~(i) = #(word i occur in document d)
take a representative part of collection and its size is ap-
proximately 10% of size of the whole collection. For topic-vector i − th coordinate shows probability to
generate word i from this topic t:
3.1.2 Initial approximation of Φ (words-topics)
~t(i) = p(i|t)
For training PLSA on the sample we use a random ini-
tialization. Computation time is linear by the number of Consider vector space, formed by topic-vectors. Topics
the documents in the collection, so the process of training form basis in this space. One can find an initial approx-
turns out to be relatively fast. The obtained matrix Φpart imation for distribution of documents d by topics as or-
can be used as initial approximation of matrix Φ for the thogonal projection to this space:
whole collection. An issue of this approach is that some
words from the vocabulary do not occur in the sample d~ = d~⊥ + d~k (19)
and every topic in matrix Φpart has zero weight for these Where
words. If we would use matrix Φpart as is, these proba- ∀~t ∈ Θ (d~⊥ , ~t) = 0 (20)
bilities would stay zero on every step (10), (2). It would
have disastrous result for likelihood (or perplexity) of our For computation efficiency we perform orthogonaliza-
model: tion and normalization of basis {~t} → {~t0 }, where
YX ∀i 6= j(t0i , t0j ) = 0 and ∀i(t0i , t0i ) = 1. It allows us to
likelihood = p(w|t)θt ≡ 0 (16) find the projection faster and simpler:
w∈d t∈Θ X
d~k = αi ~t0 i (21)
because some word w∗ would have zero probability for i
every topic. Thus some kind of smoothing is necessary.
In this work we use a trivial one: where ~t0 i – i-th vector of the orthonormal basis
X
1. add some constant for every position in every topic ~ ~t0 i ) = (d~k , ~t0 i ) = (~t0 i ,
(d, αi ~t0 i ) = αi (22)
i
∀t, w p(w|t) += const
Where scalar product is defined as follows
In this work we use V
X
1 (~x, ~y ) = xi ∗ y i (23)
const = i=1
vocabularySize
Document-vector expansion in the orthonormal basis is
2. normalize: X obtained, so one can return to topics basis and perform
∀t p(w|t) = 1 normalization as follows:
w
X
dt = 1 (24)
3.1.3 Initial approximation of Θ (document-topics) t∈Θ
During the previous step we found an initial approxima- Where dt – weight of topic t in document d
tion for matrix Φ (words by topic). Now we have to find Due to the nature of L2 norm some topic weight may
an initial approximation for matrix Θ (documents-topics) be too small or even negative, so smoothing is necessary.
given Φ. Its every column Θd. can be found as a solution We do it analogously to the previous subsection:
of the following optimization problem: 1. Replace negative weight by zero
YX
Θd. = arg max P (d|Θ) = arg max p(w|t)θwt 2. add some constant for every weight. In this paper
θ θ we use
w∈d t∈Θ
(17) 1
with boundaries const =
X numberOf T opics
θt = 1 (18)
t∈Θ 3. perform normalization
But our aim is to decrease the training time and compu- X
tation method of solving this problem is not fast enough. ∀d dt = 1
We propose to find Θd. in the norm L2 . The solution in t∈Θ
3.2 Fixed Φ approach The denominator is independent of t, thus
This approach is based on the fact that matrix Φ may be X
found by training PLSA on representative subset D0 ⊂ θtd ∝ ndw p(w|t, d) (33)
D. The rows of the matrix Θ may be obtained through w∈W
the following constrained optimization problem indepen- Primal feasibility is easily verifiable by θtd summa-
dently for each document d ∈ D: tion by t.
X X Dual feasibility follows from (32) and probability
L(θ·d ) = ndw ln φ̂wt θtd → max (25) non-negativeness.
θ·d
w∈W t∈T So this point satisfies the Karush-Kuhn-Tucker condi-
X tions.
θtd = 1 (26) Also, one can easily see that θtd ≥ 0, thus we have
t∈T found a solution of the problem (25), (26), (27).
θtd ≥ 0 (27)
4 Experiments
This approach consists of the following steps:
We conduct two kinds of experiments: we evaluate per-
1. Take a random subset of documents D0 ∈ D of suf- plexity [9] for our approaches and for classical PLSA
ficient size s and compare classification performance on topics distri-
butions, obtained by PLSA and by our approaches.
2. Obtain Φ̂ through training PLSA on D0 using EM
algorithm described in [4]
4.1 Datasets
3. Obtain θ·d through solving optimization problem
(25), (26), (27) for each document d ∈ D Both experiments were conducted on three datasets:
tweets, news articles, abstracts of scientific papers.
The third step needs some explanation.
We solve this problem with EM-algorithm. • Twitter dataset
E-step estimate the probabilities p(t|d, w) as Twitter dataset contains tweets, posted by 15000
Twitter users, written in English. We merge all
φwt θtd tweets, posted by single user into a single doc-
p(t|d, w) = P (28) ument. Every document contains approximately
φwτ θτ d
τ ∈T 1000 tweets. Documents with less than 50 words
are omitted.
M-step In order to solve the problem (25), (26), (27)
temporary omit the non-negativeness constraint (27) – • The 20 Newsgroups data set 2
we will see that the solution is non-negative. The 20 Newsgroups dataset is often used for topic
Lagrange function for problem (25), (26) takes the modeling testing on text categorization. It contains
form: short news articles on one of twenty newsgroups.
It is a collection of approximately 20,000 news-
X X X group documents, partitioned (nearly) evenly across
L(θ·d ) = ndw ln φ̂wt θtd −λ( θtd −1) (29) 20 different newsgroups.
w∈W t∈T t∈T
• arxive 3
Take a derivative:
The third data set consists of abstracts of scientific
articles. It consists of approximately 900000 ab-
∂L X φ̂wt stracts from 6 areas: Physics, Mathematics, Com-
(θ·d ) = ndw P −λ=0 (30) puter Science, Quantitative Biology, Quantitative
∂θtd φ̂wt θtd
w∈W
t∈T
Finance, Statistics. Distribution of articles by ar-
eas is not uniform: Physics contains 600 thousands,
Move λ to the right part and multiply both sides by Mathematics contains 270 thousands and Quantita-
θtd and according to (28) tive Biology only 5 thousands. Abstracts with less
X than 20 words are omitted. For experiments with
ndw p(w|t, d) = λθtd (31) fixed Φ we omit small area and take only Physics,
w∈W Mathematics and Computer Science.
Now sum both sides by every t ∈ T Some text normalization is performed: stoplisting, low-
X X ercasing, rare words removing (ones that occur less than
ndw p(w|t, d) = λ (32) 5 times in the whole collection).
w∈W t∈T
From equation (31) obtain θtd and substitute λ from 4.2 Initial approximation
(32):
Four types of initial approximation are compared:
X p(w|t, d)
θtd = ndw P P 2 http://qwone.com/
~jason/20Newsgroups/
ndω p(ω|τ, d) 3 http://arxiv.org/
w∈W
ω∈W τ ∈T
• Random initial approximation for matrix Φ (words
by topic) and matrix Θ (document by topic). De-
noted ”randomly”.
• Calculate an initial approximation for matrix Φ on
sample. Use random initial approximation for ma-
trix Θ. Denoted ”phi”.
• Calculate an initial approximation for matrix Θ on
sample. Use random initial approximation for ma-
trix Φ. Denoted ”theta”.
• Calculate an initial approximation for matrix Θ and
matrix Φ on sample. Denoted ”full”.
4.2.1 Perplexity depending on initial approximation
We evaluate the dependence of perplexity on different Figure 3: Perplexity for scientific articles
types of initial approximation. During these experiments
the number of iterations is fixed and equals 100. The As one can see all the types of initial approximation
number of topics is 25 for every experiment. decrease perplexity of model. Model with initial, finding
In figure 1, figure 2 and figure 3 one can see perplexity by our approach converges faster then model with
depending on number of iterations for different datasets random initial approximation. The same behavior is
and different types of initial approximation. The keys are observed for every data set. Perplexity values eventually
given above in the beginning of Section 4.2 obtained in 100 iterations for different datasets and
different types of initialization are presented in table 1.
Table 1: Perplexity in 100 iterations
initialization news articles tweets arxiv
random 3006.86 9287.68 1575.58
Θ only 2913.73 9111.91 1551.84
Φ only 2993.90 9143.94 1559.13
Θ and Φ 2973.90 8816.67 1561.07
4.2.2 Computation time depending on initial ap-
proximation
In these experiments we evaluate training time depend-
ing on initial approximation. We perform iterations until
Figure 1: Perplexity for Twitter data set the stop criterion is satisfied: change of perplexity is less
than 1 five times in a row. Choosing a threshold is not
the aim of this work. Similar results were observed for a
wide range of thresholds. Difference in dispersion is less
than standard deviation. Results for different datasets
and different types of initialization are presented in ta-
bles 2, 3 and 4. (It include all training time: training on
sample, orthogonalization, finding initial approximation
for Θ, training on whole collection)
Table 2: Computation time depending on initial approx-
imation for Twitter dataset
initialization perplexity time, sec #iteration
random 9258.1 7164.9 114
Θ only 9143.5 4131.9 57
Φ only 9213.2 4814.9 67
Θ and Φ 8856.5 3722.6 48
Figure 2: Perplexity for news articles
It can be seen that our approach decreases calculation
time 1.5-2 times in every data set. Perplexity in mod-
els with initial approximation, achieved by our approach
is less or equal to the perplexity of model with random
initial distribution.
Table 3: Computation time depending on initial approx-
imation for news articles
initialization perplexity time, sec #iteration
random 2996.1 161.5 110
Θ 2946.9 102.3 64
Φ 3020.6 97.5 61
Θ and Φ 2999.4 106.6 67
Table 4: Computation time depending on initial approx-
imation for Arxiv
initialization perplexity time, sec #iteration
random 1592.8 4017.8 59
Θ 1560.9 2413.7 30
Φ 1587.7 2012.9 25
Θ and Φ 1583.7 1893.3 21
Figure 5: Perplexity depending on training subset size
4.3 Fixed Φ approach for Twitter dataset
4.3.1 Perplexity
Perplexity inspection is a common way to compare dif-
ferent topic models [9] [10] [11]. We compare our model
with varying size of training subset with PLSA on differ-
ent datasets. One can see the results on Figures 4, 5 and
6.
Figure 6: Perplexity depending on training subset size
for News articles dataset
the whole collection. Then we estimate the quality of
classification by cross-validation with 10 folds. In both
experiments we use 20 topics. For classification we use
random forest classifier from package scikit-learn 4 . The
Figure 4: Perplexity depending on training subset size results can be found on Figures 7 and 8. Majority to mi-
for arxiv dataset nority class ratio for Twitter dataset is 1.17
As one can see our approximation exhibits compara-
As one can see perplexity values for PLSA and for ble result if the training dataset is not too small. It works
our approximation are nearly equal, especially for such a noticeably better for a large collection. Another impor-
large collections as arxiv or Twitter dataset. Worth men- tant observation is that in all the experiments the curve
tioning that all the perplexity curves have a ”horizontal reaches a plateau, it confirms that matrix Φ may be found
tail” – Φ matrix is not inferred better on a larger sample. by training PLSA on a representative subset.
It means that the size of a sample needed to infer Φ ma-
trix does not depend on the size of a dataset. This fact is 4.3.3 Time efficiency
especially significant for training on huge datasets.
One of the aims of our work is time efficiency improve-
ment. We compare training time for PLSA and for our
4.3.2 Text categorization approximation with |D0 | = 14 |D|. In Table 5 calculation
The other way to compare different topic models is ap- time for PLSA and for our approximation (time to train
plication task, for example document categorization. PLSA on the subset is included) are presented.
In these experiments we classify news articles by cat- Another important characteristic is average time to
egories and Twitter users by gender treating topic dis- process one document with our approximation and
tributions as features. We obtain topic distributions by PLSA. Obtained values and speed-up in comparison to
training PLSA on the whole collection and by our ap- 4 http://scikit-learn.org/stable/modules/generated/
proximation with topics, trained on varying fraction of sklearn.ensemble.RandomForestClassifier.html
and other is based on fixing Φ matrix and tested these
methods on three different datasets. Method, based on
finding initial approximation demonstrate the same be-
havior on every used dataset, the calculation time and
number of iterations to converge is decreased, yet the
quality of topic model does not decrease. We confirm
that transition from Kullback-Leibler divergence to L2
norm is appropriate to find an initial approximation for
PLSA.
Method, based on finding initial approximation demon-
strate more significant speed-up, but precision is drop.
However drop of precision is not significant, especially
on large datasets.
References
[1] Timothy N. Rubin, America Chambers, Padhraic
Figure 7: Classification Twitter user by gender Smyth, and Mark Steyvers. Statistical topic mod-
els for multi-label document classification. Mach.
Learn., 88(1-2):157–208, July 2012.
[2] Cailing Dong and Bin Zhou. Effectively detect-
ing content spam on the web using topical diversity
measures. Web Intelligence and Intelligent Agent
Technology, IEEE/WIC/ACM International Confer-
ence on, 1:266–273, 2012.
[3] Venkatesh Ramanathan and Harry Wechsler.
phishgillnet–phishing detection methodology us-
ing probabilistic latent semantic analysis, adaboost,
and co-training. EURASIP Journal on Information
Security, 2012(1):1, 2012.
[4] Thomas Hofmann. Probabilistic latent semantic in-
dexing. In Proceedings of the 22Nd Annual Inter-
national ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’99,
Figure 8: Classification news articles by category pages 50–57, New York, NY, USA, 1999. ACM.
[5] Chuntao Hong, Yurong Chen, Weimin Zheng, Jiu-
Dataset PLSA, sec approx., speed-up, long Shan, Yurong Chen, and Yimin Zhang. Paral-
sec times lelization and characterization of probabilistic la-
Arxiv 4333 1972 2.2 tent semantic analysis. In Parallel Processing,
Twitter 6384 2702 2.4 2008. ICPP ’08. 37th International Conference on,
News 164 64 2.6 pages 628–635, 2008.
articles
[6] Raymond Wan, VoNgoc Anh, and Hiroshi Mamit-
Table 5: Time to process collection suka. Efficient probabilistic latent semantic anal-
ysis through parallelization. In GaryGeunbae
time needed to train original PLSA on a single document Lee, Dawei Song, Chin-Yew Lin, Akiko Aizawa,
in average are given in Table 6 (time to train PLSA on Kazuko Kuriyama, Masaharu Yoshioka, and Tet-
the subset is not included). suya Sakai, editors, Information Retrieval Technol-
ogy, volume 5839 of Lecture Notes in Computer
Dataset PLSA, ms approx., speed-up,
Science, pages 432–443. Springer Berlin Heidel-
ms times
berg, 2009.
Arxiv 14.3 3.0 4.8
Twitter 414.2 70.7 5.9 [7] Ayman Farahat. F.r.: Improving probabilistic latent
News 9.0 0.7 12.9 semantic analysis using principal component anal-
articles ysis. In In: Eleventh Conference of the European
Chapter of the Association for Computational Lin-
Table 6: Average time to process one document guistics (EACL, 2006.
[8] Thomas Hofmann. Probabilistic latent semantic in-
dexing. In Proceedings of the 22Nd Annual Inter-
5 Conclusion
national ACM SIGIR Conference on Research and
We develop two methods for computation time reduction Development in Information Retrieval, SIGIR ’99,
one is based on finding appropriate initial approximation pages 50–57, New York, NY, USA, 1999. ACM.
[9] Anna Potapenko and Konstantin Vorontsov. Robust
plsa performs better than lda. In ECIR, pages 784–
787, 2013.
[10] David M. Blei, Andrew Y. Ng, and Michael I. Jor-
dan. Latent dirichlet allocation. J. Mach. Learn.
Res., 3:993–1022, March 2003.
[11] Hanna M. Wallach, Iain Murray, Ruslan Salakhut-
dinov, and David Mimno. Evaluation methods
for topic models. In Proceedings of the 26th An-
nual International Conference on Machine Learn-
ing, ICML ’09, pages 1105–1112, New York, NY,
USA, 2009. ACM.