Introduction

ARTM vs. LDA: an SVD Extension Case Study

Sergey Nikolenko

sergey@logic.pdmi.ras.ru 0 1 2 3 0 Deloitte Analytics Institute , Moscow , Russia 1 Kazan (Volga Region) Federal University , Kazan , Russia 2 Laboratory for Internet Studies, NRU Higher School of Economics 3 Steklov Institute of Mathematics at St. Petersburg , Russia

In this work, we compare two extensions of two different topic models for the same problem of recommending full-text items: previously developed SVD-LDA and its counterpart SVD-ARTM based on additive regularization. We show that ARTM naturally leads to the inference algorithm that has to be painstakingly developed for LDA. Topic models are an important part of the natural language processing landscape, providing unsupervised ways to quickly evaluate what a whole corpus of texts is about and classify them into well-defined topics. LDA extensions provide ways to augment basic topic models with additional information and retool them to serve other purposes. In a previous work, we have combined the SVD and LDA decompositions into a single unified model that optimizes the joint likelihood function and thus infers topics that are especially useful for improving recommendations. We have provided an inference algorithm based on Gibbs sampling, developing an approximate sampling scheme based on a first order approximation to Gibbs sampling [1]. A recently developed ARTM approach [2-5] extends the basic pLSA model with regularizers and provides a unified way to add new additive regularizers; inference algorithms result with simple differentiation of the regularizers. In this work, we apply ARTM to the problem of adding SVD decompositions to a topic model; we show that one can automatically arrive at an inference algorithm very similar to our previous SVD-LDA approach.

Introduction

LDA and SVD-LDA The graphical model of LDA [ 6, 7 ] is shown on Figure 1a. We assume that a corpus of D documents contains T topics expressed by W different words. Each document d 2 D is modeled as a discrete distribution (d) on the set of topics: p(zw = j) = (d), where z is a discrete variable that defines the topic of each word w 2 d. Each topic, in turn, corresponds to a multinomial distribution on words: p(w j zw = k) = (wk). The model also introduces prior Dirichlet distributions with parameters for the topic vectors , Dir( ), and for (a)

(b) the word distributions , Dir( ). A document is generated word by word: for each word, we (1) sample the topic index k from distribution (d); (2) sample the word w from distribution (wk). Inference in LDA is usually done via either variational approximations or Gibbs sampling; we use the latter since it is easy to generalize to further extensions. In the basic LDA model, Gibbs sampling reduces to the so-called collapsed Gibbs sampling, where and variables are integrated out, and zw are iteratively resampled according to the following distribution: p(zw = t j z w; w; ; ) / Pt02nT(dwn);(tdw+);t0 + Pw02nW(ww)n;t(+ww0;)t+ ; where n(dw);t is the number of words in document d chosen with topic t and n(ww);t is the number of times word w has been generated from topic t apart from the current value zw; both counters depend on the other variables z w. Samples are then used to estimate model variables: d;t = Pt02nT(dwn);(tdw+);t0 + ; w;t = Pw02nW(ww)n;t(+ww0);t+ ; where w;t denotes the probability to draw word w in topic t and d;t is the probability to draw topic t for a word in document d.

The basic LDA model has been used and extended in numerous applications; the relevant class of extensions for us now takes into account additional information that may be available together with the documents and may reveal additional insights into the topical structure. For instance, the Topics over Time model and dynamic topic models apply when documents have timestamps of their creation (e.g., news articles or blog posts) [ 8–10 ], DiscLDA assumes that each document is assigned with a categorical label and attempts to utilize LDA for mining topic classes related to classification [ 11 ], the Author-Topic model incorporates information about the authors of a document [ 12, 13 ], and so on.

The SVD-LDA model, presented in [ 1 ], can be regarded as an extension of the Supervised LDA (sLDA) model [ 14 ]. The sLDA graphical model is shown on Fig. 1b. In sLDA, each document is now augmented with a response variable y drawn from a normal distribution centered around a linear combination of the document’s topical distribution (z, average z variables in this document) with some unknown parameters b, a that are also to be trained: y N (y j b>z + a; 2):

The original work [ 14 ] presents an inference algorithm for sLDA based on variational approximations, but in this work we operate with Gibbs sampling which will be easier to extend to SVD-LDA later. Thus, we show an sLDA Gibbs sampling scheme. It differs from the original LDA in that the model likelihood gets another factor corresponding to the y variable: p(yd j z; b; 2) / exp (yd b>z

a)2=2 ; and the total likelihood is now p(z j w; y; b; 2) / Qd B(Bn(d+) ) Qt B(Bn(t+) ) Qd e (yd b>zd a)2=2: On each iteration of the sampling algorithm, we now first sample z for fixed b and then train b for fixed (sampled) z. The sampling distributions for each z variable, according to the equation above, are p(zw = t j z w; w; ; ) / q(zw; t; z w; w; ; )e 21 (yd b>z a)2 = n(dw);t+ n(ww);t+ e 21 (yd b>z a)2 : The latter equation can be either used P n(dw);t0 + P n(ww0;)t+ t0 w0 directly or further transformed by separating z w explicitly.

SVD-LDA considers a recommender system based on likes and dislikes, so it uses the logistic sigmoid (x) = 1= (1 + exp( x)) of a linear function to model the probability of a “like”: p(successi;a) = b>z + a . In this version of sLDA, the graphical model remains the same, only conditional probabilities change. The total likelihood is now p(z j w; y; b; ; ; 2) / d Y B(nd + ) Y B(nt + ) Y

B( ) B( ) t

Y d x2Xd b>zd + a yx 1 b>zd + a 1 yx ; where Xd is the set of experiments (ratings) for document d, and yx is the binary result of one such experiment. The sampling procedure also remains the same, except that now we train logistic regression with respect to b, a for fixed z instead of linear regression, and the sampling probabilities for each z variable are now p(zw = t j z w; w; ; ) / b>zd + a iyx h 1 b>zd + a i1 yx q(zw; t; z w; w; ; )

Y h x2Xd = where sd is the number of successful experiments among Xd, and pd = 1+e b1>zd a

The SVD-LDA extension has been introduced in [ 1 ] as follows: for recommendations we use an SVD model with additional predictors corresponding to how much a certain user or group of user likes the topics trained in the LDA model; since our dataset is binary (like-dislike), we use a logistic version of the SVD model: p(successi;a) = (r^i;a) = + bi + ba + qa>pi + a>li ; where pi may be absent in case of cold start, and li may be shared among groups (clusters) of users. The total likelihood of the dataset with ratings comprised of triples D = f(i; a; r)g (user i rated item a as r 2 f 1; 1g) is a product of the likelihood of each rating (assuming, as usual, that they are independent): p(D j esd log pd+(jXdj sd) log(1 pd); ; bi; ba; pi; qa; li; a) = QD (r^i;a)[r=1] (1 (r^i;a))[r= 1] ; and the logarithm is log p(D j ; bi; ba; pi; qa; li; a) = PD ([r = 1] log (r^i;a) + [r = 1] log (1 (r^i;a))) ; where [r = 1] = 1 if r = 1 and [r = 1] = 0 otherwise, and a is the vector of topics trained for document a in the LDA model, a = N1a Pw2a zw, where Na is the length of document a. Sampling probabilities for each z variable now look like p(zw = t j z w; w; ; ) / q(zw; t; z w; w; ; )p(D j ; bi; ba; pi; qa; li; aw!t) = = where r^iS;VaD = + bi + ba + qa>pi; and aw!t is the vector of topics for document a where topic t is substituted in place of zw. We see that in the formula above, to compute the sampling distribution for a single zw variable one has to take a sum over all ratings all users have provided for this document, and due to the presence of the sigmoid function one cannot cancel out terms and reduce the SVD in sum to updating counts. It is possible to store precomputed values of r^i;a memory, but it does not help because the zw variables change during sampling, and when they do all values of (r^iS;VaD + li> aw!t) also have to be recomputed for each rating from the database.

To make the model feasible, a simplified SVD-LDA training algorithm was developed in [ 1 ] that could run reasonably fast on large datasets. It used a first order approximation to the log likelihood based on its Taylor series at zero:

(r^iS;VaD + a>li) li

We denote sa = PD [r = 1] r^iS;VaD + a>li li: We can now precompute sa (a vector over topics) for each document right after SVD training (with additional memory of the same size as the matrix) and use it in LDA sampling: p(zw = t j z w; w; ; ) / q(zw; t; z w; w; ; )p(D j ; bi; ba; pi; qa; li; aw!t) and the latter is proportional to simply Pt02nT(dwn);(tdw+);t0 + Pw02nW(ww)n;t(+ww0);t+ est because sa aw!t = sa a swzw + stzt, and the first two terms do not depend on t which is being sampled. Thus, the first order approximation yields a simple modification of LDA sampling that incurs relatively small computational overhead as compared to the sampling itself.

We have outlined a general approximate sampling scheme; several different variations are possible depending on which predictors are shared in the basic SVD model, p(successi;a) = (r^i;a) : In general, a separate set of li features for every user would lead to heavy overfitting, so we used two variations: either share li = l among all users or share li = lc among certain clusters of users, preferably inferred from some external information, e.g., demographic features. Both variations can be used for cold start with respect to users. Table 1 summarizes the results of experiments that show that SVD-LDA does indeed improve upon the basic LDA model [ 1 ]. 3

SVD-ARTM

In recent works [ 2–4 ], K. Vorontsov and coauthors demonstrated that if one adds regularizers in the objective function on the training stage of the basic probabilistic Latent Semantic Analysis (pLSA) model, which actually predates LDA [ 15 ], one can impose a very wide variety of constraints on the resulting topic model. This approach has been called Additive Regularization of Topic Models (ARTM). In particular, the authors showed that one can formulate a regularizer that imposes constraints on the smoothness of topic-document and word-topic distributions that will correspond to the Bayesian approach expressed in LDA (i.e., it will smooth out the distributions).

Formally speaking, for a set of regularizers Ri( ; ), i = 1::r, and regularization weights i, i = 1::r, we can extend the objective function to maximize L( ; ) + R( ; ) = Pd2D Pw2W ndw log p(w j d) + Pir=1 iRi( ; ): By Karush–Kuhn–Tucker conditions, any solution of the resulting problem satisfies the following system of equations: ptdw = normt+2T ( wt td) ; wt = normw+2W nwt + nwt =

X ndwptdw;

ntd =

X ndwptdw; d2D ; td = normt+2T w2d maxfxa;0g where norm+ denotes non-negative normalization: norma+2Axa = Pb2A maxfxb;0g . This system of equations yields a natural iterative algorithm (Newton’s method) for finding the parameters wt and td, equivalent to EM inference in pLSA; see [ 3 ] for a full derivation and a more detailed treatment. Thus, we have a model which is very easy to extend and which is computationally cheaper to train than the LDA model, especially LDA extensions that rely on Gibbs sampling.

To extend ARTM with an SVD-based regularizer, we begin with a regularizer in the same form as in Section 2: the total likelihood of the dataset with ratings comprised of triples D = f(i; a; r)g (user i rated item a as r 2 f 1; 1g) is a product of the likelihood of each rating, so its logarithm is

R( ; ) = log p(D j ; bi; ba; pi; qa; li; a) = = X ([r = 1] log (r^i;a) + [r =

D 1] log (1 (r^i;a))) ; where [r = 1] = 1 if r = 1 and [r = 1] = 0 otherwise, and a is the vector of topics trained for document a in the LDA model, a = N1a Pw2a zw, where Na is the length of document a, and

SVD + a>li = r^i;a = r^i;a

+ bi + ba + qa>pi + a>li: To add this regularizer to the pLSA model, we have to compute its partial derivatives with respect to the parameters: note that the latter equality is exactly the same as (1) (hence we omit the derivation), only now it is a direct part of the algorithm rather than a first order approximation to the sampling. The final algorithm is, thus, to iterate the following: ptaw = normt+2T ( wt ta) ;

nwt = wt = normw+2W nwt;

X nawptaw;

a2D nta =

X nawptaw;

w2a

1 [r = 1]

(r^iS;VaD + a>li) liA : ta

X (i;a;r)2D Similar to SVD-LDA, we can precompute sa = PD [r = 1] r^iS;VaD + a>li li (it is a vector over topics) for each document after SVD is trained and use it throughout a pLSA iteration. 4

Conclusion

In this work, we have developed an ARTM regularizer that adds an SVD-based matrix decomposition model on top of ARTM. We have shown that the resulting inference algorithms closely match the inference algorithms developed in the SVD-LDA modification of LDA with a first-order approximation to the Gibbs sampling. In further work, we plan to implement this regularizer and incorporate it into the BigARTM library [ 2, 3 ].

Acknowledgements. This work was supported by the Russian Science Foundation grant no. 15-11-10019.

1. Nikolenko , S.I. : SVD-LDA: Topic modeling for full-text recommender systems . In: Proc. 14th Mexican International Conference on Artificial Intelligence. LNAI vol. 9414 , Springer ( 2015 ) 67 - 79

2. Vorontsov , K. : Additive regularization for topic models of text collections . Doklady Mathematics 89 ( 2014 ) 301 - 304

3. Potapenko , A. , Vorontsov , K. : Robust pLSA performs better than LDA . In: Proc. 35th European Conf. on IR Research . LNCS vol. 7814 , Springer ( 2013 ) 784 - 787

4. Vorontsov , K. , Frei , O. , Apishev , M. , Romov , P. , Suvorova , M. , Yanina , A. : Nonbayesian additive regularization for multimodal topic modeling of large collections . In: Proc. of the 2015 Workshop on Topic Models: Post-Processing and Applications. TM '15 , New York, NY, USA, ACM ( 2015 ) 29 - 37

5. Sokolov , E. , Bogolubsky , L. : Topic models regularization and initialization for regression problems . In: Proc. of the 2015 Workshop on Topic Models: PostProcessing and Applications . TM ' 15 , New York, NY, USA, ACM ( 2015 ) 21 - 27

6. Blei , D.M. , Ng , A.Y. , Jordan , M.I. : Latent Dirichlet allocation . Journal of Machine Learning Research 3 ( 2003 ) 993 - 1022

7. Griffiths , T. , Steyvers , M. : Finding scientific topics . Proceedings of the National Academy of Sciences 101 (Suppl. 1) ( 2004 ) 5228 - 5335

8. Wang , X. , McCallum , A. : Topics over time: a non-Markov continuous-time model of topical trends . In: Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , New York, NY, USA, ACM ( 2006 ) 424 - 433

9. Blei , D.M. , Lafferty , J.D. : Dynamic topic models . In: Proc. of the 23rd International Conference on Machine Learning , New York, NY, USA, ACM ( 2006 ) 113 - 120

10. Wang , C. , Blei , D.M. , Heckerman , D. : Continuous time dynamic topic models . In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence . ( 2008 )

11. Lacoste-Julien , S. , Sha , F. , Jordan , M.I.: DiscLDA: Discriminative learning for dimensionality reduction and classification . Advances in Neural Information Processing Systems 20 ( 2008 )

12. Rosen-Zvi , M. , Griffiths , T. , Steyvers , M. , Smyth , P.: The author-topic model for authors and documents . In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence , Arlington, Virginia, United States, AUAI Press ( 2004 ) 487 - 494

13. Rosen-Zvi , M. , Chemudugunta , C. , Griffiths , T. , Smyth , P. , Steyvers , M. : Learning author-topic models from text corpora . ACM Trans. Inf. Syst . 28 ( 2010 ) 1 - 38

14. Blei , D.M. , McAuliffe , J.D. : Supervised topic models . Advances in Neural Information Processing Systems 22 ( 2007 )

15. Hoffmann , T. : Unsupervised learning by probabilistic latent semantic analysis . Machine Learning 42 ( 2001 ) 177 - 196