1 Introduction

0 Proceedings of the Tenth Spring Researcher's Colloquium on Databases and Information Systems , Veliky Novgorod, Russia, 2014 1 The Institute for System Programming (ISP) of the Russian Academy of Sciences

Probabilistic Latent Semantic Analysis (PLSA) is an effective technique for information retrieval, but it has a serious drawback: it consumes a huge amount of computational resources, so it is hard to train this model on a large collection of documents. The aim of this paper is to improve time efficiency of the training algorithm. Two different approaches are explored: one is based on efficient finding of an appropriate initial approximation, the idea of another is that for the most of collection topics may be extracted from relatively small fraction of the data.

1 Introduction

Topic modeling is an application of machine learning to text analysis. Topic modeling is useful for different text analysis tasks, for example: document categorization [ 1 ], spam detection [2], phishing detection [3] and many other applications.

One of the widespread algorithms is Probabilistic Latent Semantic Analysis (PLSA), suggested by Thomas Hofmann in [4].

Generative model

PLSA is based on generative model ”bag of words”: every document is assumed to be a multinomial distribution over topics. Every topic is a multinomial distribution over words. Generation model may be defined as follow: • For every position in document d i.i.d choose topic t from distribution of topics by document • Choose word w from topic t The aim of topic modeling is to recover topics and distribution of document by topics.

Topic modeling as optimization problem

According to generative model one can estimate probability to observe collection D as: p(D) = Y Y X p(tjd)p(wjt) (1) d2D w2d t Denote 'wt = p(wjt) and td = p(tjd). One may obtain 'wt and td as solution of optimization problem with boundary and 1.3 1.3.1

L = X d2D w2d

X log X 'wt td ! max

t 8t

X 'wt = 1; 8d X td = 1 w w 8t; w 'wt 0; 8d; t wt 0

Topic modeling as matrix decomposition Kullback-Leibler divergence

Kullback-Leibler divergence is a non-negative measure of diffrence between two different probability distribution:

KL(pijjqi) = n X pi ln i=1 pi qi Consider an empirical distribution p^i and some parametric distribution qi = qi( ) which is used to explain p^i . Easy to see that in this case minimization of KL– divergence is equivalent to estimation of by maximumlikelihood: (2) (3) (4) (5) KL(pijjqi( )) = ! min (6) n X pi ln i=1

pi qi( ) n , X pi ln(qi( )) ! max

i=1

Thus one can easily see that (2) equivalent to weighted Kullback-Leibler divergence minimization:

X ndKLw d2D ndw nd jj X 'wt td t2T ! ! min ; (7) where nwd– number of words w in document d, nd – number of words in document d. 1.3.2

Matrix decomposition

Denote empirical distribution of words by document as p^(w; d) = nwd . According to this notation one can connd sider the problem (2) as matrix decomposition: where matrix F = (p^(w; d))W D is empirical distribution of words by document, matrix = ('wt)W D is distribution of words by topics and matrix = ( td)T D is distribution of documents by topics. Thus, our optimization problem may be rewritten in KullbackLeibler notation as

KL(F; ) ! min (9) Thus PLSA may be observed as stochastic matrix decomposition. 1.4

Expectation-Maximization algorithm

Unfortunately (2) has no analytical solution. Thus we use Expectation - Maximization (EM) algorithm. This algorithm consists of two steps: 1. Estimation of number ndwt of words w, produced by topic t in document d. (E - step) 2. Optimization of distribution of documents by topics and optimization of distribution of topics by words relying on the ndwt values obtained during E - step . (M - step) One can estimate ndwt as follows: ndwt = nwdp(wjt)p(tjd)

Pt p(wjt)p(tjd) where nwd – number of words w in document d. Thus, probability p(wjt) may be estimated as p(wjt) = nwt = nt

Pd ndwt Pw Pd ndwt Where nwt – number of words w, produced by topic t: nwt =

ndwt X d2D nt = X

nwt w2V p(tjd) = ntd nd ntd =

ndwt X w2V nt – number of words, produced by topic t Similarly for p(tjd): Where nd – number of words in document d, ntd – estimated number of words in document d, produced by topic t:

As one can see, the asymptotic time of this algorithm is O(D V T I) where D – the number of documents, V – average number of distinct words in document, T – number of topics and I – number of iterations until convergence. Inference of the PLSA on a large dataset requires a lot of time thus the methods of decreasing of computation time are important.

Number of topics and number of documents are defined by application. Size of vocabulary (number of distinct (10) (11) (12) (13) (14) (15) words) can be decreased by text normalization (removing of stop-words, lowercasing, etc). Number of iterations until convergence depends of initial approximation of PLSA parameters and , so a good initial approximation can reduce the number of iterations until convergence. The current study presents an efficient approach to find a beneficial initial approximation. The other method of computation time reduction based on idea that matrix may be obtained on small representative part of documents collection. 2

Related Work

The original algorithm was described in 1999 in [4]. Since 1999 numerous papers were devoted to PLSA, but only a few of them are devoted to time efficiency improvement. In [ 5 ] authors improve time efficiency by parallelization of the algorithm using OpenMP. Authors report 6 times speed-up on 8 CPU machine. Work [ 6 ] improves the result of a previous work by using MPI. But both of these studies try to solve the problem of time efficiency purely by programming methods.

In [ 7 ] Farahat uses LSA for finding an initialization for PLSA. LSA is based on SVD 1 matrix decomposition in L2 norm and lacks probabilistic interpretation. PLSA performs stochastic matrix decomposition based on Kullback-Leibler divergence and has a simple probability interpretation, but it inherits the problem of every non-convex optimization algorithm: it may converge to a local minimum instead of the global one. Combination of LSA and PLSA leverages the best features of these models: usage of LSA training result as an initial approximation helps to avoid convergence to a poor local minimum. But the problem of time efficiency is not explored. In [ 7 ] it is shown that L2 norm usage is appropriate to find an initialization for PLSA inference algorithm, we will use this result in our work. An idea of obtaining a distribution over topics for a document not included in a collection that PLSA was initially trained on was expressed in [ 8 ]. The author suggests to perform this through EM-scheme holding matrix fixed. However, he proposes this method only for query processing but not for PLSA training speed-up. 3

Proposed approach

In this work we present two different approach for computational time reduction. One is based on finding initial approximation and reduction of number of iteration to convergence. The other is based on obtaining part on representative sample, fixing part and then obtaining on whole collection. 3.1

Finding initial approximation

In this work we do not use LSA nor clustering methods. Instead we take a subset of our collection (for example 10%), apply PLSA to this sample and calculate an initial approximation using obtained matrix . Computa1SVD – Singular value decomposition, a factorization of a matrix into the product of a unitary matrix, a diagonal matrix, and another unitary matrix tion time of PLSA is proportional to the number of documents in collection and training PLSA on 10% part of collection is at least ten times faster than on the whole collection (per iteration). 3.1.1

Taking a sample

In order to take a representative sample we need to take a random sample. The exact size of sample is not important so we use a rather simple scheme: we include documents independently with probability 10%. So we take a representative part of collection and its size is approximately 10% of size of the whole collection.

3.1.2 Initial approximation of (words-topics)

For training PLSA on the sample we use a random initialization. Computation time is linear by the number of the documents in the collection, so the process of training turns out to be relatively fast. The obtained matrix part can be used as initial approximation of matrix for the whole collection. An issue of this approach is that some words from the vocabulary do not occur in the sample and every topic in matrix part has zero weight for these words. If we would use matrix part as is, these probabilities would stay zero on every step (10), (2). It would have disastrous result for likelihood (or perplexity) of our model: likelihood = Y

X p(wjt) t 0

(16) because some word w would have zero probability for every topic. Thus some kind of smoothing is necessary. In this work we use a trivial one: 1. add some constant for every position in every topic

In this work we use 2. normalize:

8t; w p(wjt) += const const =

1 vocabularySize 8t X p(wjt) = 1

3.1.3 Initial approximation of (document-topics)

During the previous step we found an initial approximation for matrix (words by topic). Now we have to find an initial approximation for matrix (documents-topics) given . Its every column d: can be found as a solution of the following optimization problem: d: = arg max P (dj ) = arg max Y

X p(wjt) wt But our aim is to decrease the training time and computation method of solving this problem is not fast enough. We propose to find d: in the norm L2. The solution in L2 would not be a solution in our space with Kullback-Leibler divergence, but we are not looking for the exact solution, but for an appropriate initialization.

Assume that all words are replaced by their serial numbers in the vocabulary. Let us consider topics and documents as vector in RV , where V stands for vocabulary size. i th coordinate in vector-document represents the number of times a word with the number i occurs in the document:

d~(i) = #(word i occur in document d) For topic-vector i th coordinate shows probability to generate word i from this topic t:

~t(i) = p(ijt) Consider vector space, formed by topic-vectors. Topics form basis in this space. One can find an initial approximation for distribution of documents d by topics as orthogonal projection to this space: Where For computation efficiency we perform orthogonalization and normalization of basis f~tg ! f~t0g, where 8i 6= j(t0i; t0j ) = 0 and 8i(t0i; t0i) = 1. It allows us to find the projection faster and simpler:

d~ = d~? + d~k ~ 8t 2

(d~?; ~t) = 0 d~k = X ~ it0i i where ~t0i – i-th vector of the orthonormal basis (d~; ~t0i) = (d~k; ~t0i) = (~t0i; X i~t0i) = i Where scalar product is defined as follows (~x; ~y) =

i V X xi yi i=1 Document-vector expansion in the orthonormal basis is obtained, so one can return to topics basis and perform normalization as follows:

X dt = 1 t2 Where dt – weight of topic t in document d

Due to the nature of L2 norm some topic weight may be too small or even negative, so smoothing is necessary. We do it analogously to the previous subsection: 1. Replace negative weight by zero 2. add some constant for every weight. In this paper we use const =

1 numberOf T opics 3. perform normalization 8d X dt = 1 (19) (20) (21) (22) (23) (24) This approach is based on the fact that matrix may be found by training PLSA on representative subset D0 D. The rows of the matrix may be obtained through the following constrained optimization problem independently for each document d 2 D:

w2W L( d) = X ndw ln X ^wt td ! max d t2T X td = 1 t2T td

0 This approach consists of the following steps: 1. Take a random subset of documents D0 2 D of sufficient size s 2. Obtain ^ through training PLSA on D0 using EM algorithm described in [4] 3. Obtain d through solving optimization problem (25), (26), (27) for each document d 2 D The third step needs some explanation.

We solve this problem with EM-algorithm.

E-step estimate the probabilities p(tjd; w) as p(tjd; w) = P 2T wt td w d

M-step In order to solve the problem (25), (26), (27) temporary omit the non-negativeness constraint (27) – we will see that the solution is non-negative.

Lagrange function for problem (25), (26) takes the form: L( d) = X ndw ln X ^wt td w2W t2T (X t2T td 1) (29) Take a derivative: w2W

^wt ndw P ^wt td t2T = 0 (30) w2W X w2W t2T

Move to the right part and multiply both sides by td and according to (28)

X ndwp(wjt; d) = td Now sum both sides by every t 2 T

X ndwp(wjt; d) =

From equation (31) obtain td and substitute (32): td =

X w2W ndw P p(wjt; d)

P nd!p(!j ; d) !2W 2T (25) (26) (27) (28) (31) (32) from

The denominator is independent of t, thus td /

X ndwp(wjt; d) w2W (33)

Primal feasibility is easily verifiable by td summation by t.

Dual feasibility follows from (32) and probability non-negativeness.

So this point satisfies the Karush-Kuhn-Tucker conditions.

Also, one can easily see that td 0, thus we have found a solution of the problem (25), (26), (27). 4

Experiments

We conduct two kinds of experiments: we evaluate perplexity [ 9 ] for our approaches and for classical PLSA and compare classification performance on topics distributions, obtained by PLSA and by our approaches. 4.1

Datasets

Both experiments were conducted on three datasets: tweets, news articles, abstracts of scientific papers. • Twitter dataset

Twitter dataset contains tweets, posted by 15000 Twitter users, written in English. We merge all tweets, posted by single user into a single document. Every document contains approximately 1000 tweets. Documents with less than 50 words are omitted. • The 20 Newsgroups data set 2

The 20 Newsgroups dataset is often used for topic modeling testing on text categorization. It contains short news articles on one of twenty newsgroups. It is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. • arxive 3

The third data set consists of abstracts of scientific articles. It consists of approximately 900000 abstracts from 6 areas: Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics. Distribution of articles by areas is not uniform: Physics contains 600 thousands, Mathematics contains 270 thousands and Quantitative Biology only 5 thousands. Abstracts with less than 20 words are omitted. For experiments with fixed we omit small area and take only Physics, Mathematics and Computer Science.

Some text normalization is performed: stoplisting, lowercasing, rare words removing (ones that occur less than 5 times in the whole collection).

4.2 Initial approximation

Four types of initial approximation are compared: 2http://qwone.com/~jason/20Newsgroups/ 3http://arxiv.org/ • Random initial approximation for matrix (words by topic) and matrix (document by topic). Denoted ”randomly”. • Calculate an initial approximation for matrix on sample. Use random initial approximation for matrix . Denoted ”phi”. • Calculate an initial approximation for matrix on sample. Use random initial approximation for matrix . Denoted ”theta”. • Calculate an initial approximation for matrix matrix on sample. Denoted ”full”. and 4.2.1

Perplexity depending on initial approximation

We evaluate the dependence of perplexity on different types of initial approximation. During these experiments the number of iterations is fixed and equals 100. The number of topics is 25 for every experiment.

In figure 1, figure 2 and figure 3 one can see perplexity depending on number of iterations for different datasets and different types of initial approximation. The keys are given above in the beginning of Section 4.2

As one can see all the types of initial approximation decrease perplexity of model. Model with initial, finding by our approach converges faster then model with random initial approximation. The same behavior is observed for every data set. Perplexity values eventually obtained in 100 iterations for different datasets and different types of initialization are presented in table 1. In these experiments we evaluate training time depending on initial approximation. We perform iterations until the stop criterion is satisfied: change of perplexity is less than 1 five times in a row. Choosing a threshold is not the aim of this work. Similar results were observed for a wide range of thresholds. Difference in dispersion is less than standard deviation. Results for different datasets and different types of initialization are presented in tables 2, 3 and 4. (It include all training time: training on sample, orthogonalization, finding initial approximation for , training on whole collection)

It can be seen that our approach decreases calculation time 1.5-2 times in every data set. Perplexity in models with initial approximation, achieved by our approach is less or equal to the perplexity of model with random initial distribution. and 4.3 4.3.1 Perplexity inspection is a common way to compare different topic models [ 9 ] [ 10 ] [ 11 ]. We compare our model with varying size of training subset with PLSA on different datasets. One can see the results on Figures 4, 5 and 6 .

As one can see perplexity values for PLSA and for our approximation are nearly equal, especially for such a large collections as arxiv or Twitter dataset. Worth mentioning that all the perplexity curves have a ”horizontal tail” – matrix is not inferred better on a larger sample. It means that the size of a sample needed to infer matrix does not depend on the size of a dataset. This fact is especially significant for training on huge datasets. 4.3.2

Text categorization

The other way to compare different topic models is application task, for example document categorization. In these experiments we classify news articles by categories and Twitter users by gender treating topic distributions as features. We obtain topic distributions by training PLSA on the whole collection and by our approximation with topics, trained on varying fraction of the whole collection. Then we estimate the quality of classification by cross-validation with 10 folds. In both experiments we use 20 topics. For classification we use random forest classifier from package scikit-learn 4. The results can be found on Figures 7 and 8. Majority to minority class ratio for Twitter dataset is 1.17

As one can see our approximation exhibits comparable result if the training dataset is not too small. It works noticeably better for a large collection. Another important observation is that in all the experiments the curve reaches a plateau, it confirms that matrix may be found by training PLSA on a representative subset. 4.3.3

Time efficiency

One of the aims of our work is time efficiency improvement. We compare training time for PLSA and for our approximation with jD0j = 14 jDj. In Table 5 calculation time for PLSA and for our approximation (time to train PLSA on the subset is included) are presented.

Another important characteristic is average time to process one document with our approximation and PLSA. Obtained values and speed-up in comparison to 4http://scikit-learn.org/stable/modules/generated/ sklearn.ensemble.RandomForestClassifier.html time needed to train original PLSA on a single document in average are given in Table 6 (time to train PLSA on the subset is not included). We develop two methods for computation time reduction one is based on finding appropriate initial approximation and other is based on fixing matrix and tested these methods on three different datasets. Method, based on finding initial approximation demonstrate the same behavior on every used dataset, the calculation time and number of iterations to converge is decreased, yet the quality of topic model does not decrease. We confirm that transition from Kullback-Leibler divergence to L2 norm is appropriate to find an initial approximation for PLSA.

Method, based on finding initial approximation demonstrate more significant speed-up, but precision is drop. However drop of precision is not significant, especially on large datasets. [2] Cailing Dong and Bin Zhou. Effectively detecting content spam on the web using topical diversity measures. Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on, 1:266–273, 2012. [3] Venkatesh Ramanathan and Harry Wechsler. phishgillnet–phishing detection methodology using probabilistic latent semantic analysis, adaboost, and co-training. EURASIP Journal on Information Security, 2012(1):1, 2012. [4] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM.

[1] Timothy

Rubin , America Chambers, Padhraic Smyth, and Mark

Steyvers . Statistical topic models for multi-label document classification . Mach . Learn., 88 ( 1-2 ): 157 - 208 , July 2012 .

[5]

Chuntao

Hong , Yurong Chen, Weimin Zheng, Jiulong Shan, Yurong Chen, and Yimin Zhang. Parallelization and characterization of probabilistic latent semantic analysis . In Parallel Processing , 2008 . ICPP ' 08 . 37th International Conference on, pages 628 - 635 , 2008 .

[6]

Raymond

Wan , VoNgoc Anh, and

Hiroshi

Mamitsuka . Efficient probabilistic latent semantic analysis through parallelization . In GaryGeunbae Lee , Dawei Song, Chin-Yew

Lin

, Akiko Aizawa, Kazuko Kuriyama, Masaharu Yoshioka, and Tetsuya Sakai, editors, Information Retrieval Technology , volume 5839 of Lecture Notes in Computer Science, pages 432 - 443 . Springer Berlin Heidelberg, 2009 .

[7]

Ayman

Farahat . F.r.: Improving probabilistic latent semantic analysis using principal component analysis . In In: Eleventh Conference of the European Chapter of the Association for Computational Linguistics (EACL , 2006 .

[8]

Thomas

Hofmann . Probabilistic latent semantic indexing . In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '99 , pages 50 - 57 , New York, NY, USA, 1999 . ACM.

[9]

Anna

Potapenko and

Konstantin

Vorontsov . Robust plsa performs better than lda . In ECIR , pages 784 - 787 , 2013 .

[10] David

Blei , Andrew Y.

Ng , and Michael I.

Jordan . Latent dirichlet allocation . J. Mach. Learn. Res. , 3 : 993 - 1022 , March 2003 .

[11] Hanna

Wallach , Iain Murray, Ruslan Salakhutdinov, and David

Mimno . Evaluation methods for topic models . In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09 , pages 1105 - 1112 , New York, NY, USA, 2009 . ACM.