ARTM vs. LDA: an SVD Extension Case Study Sergey Nikolenko1,2,3,4 1 Laboratory for Internet Studies, NRU Higher School of Economics 2 Steklov Institute of Mathematics at St. Petersburg, Russia 3 Kazan (Volga Region) Federal University, Kazan, Russia 4 Deloitte Analytics Institute, Moscow, Russia sergey@logic.pdmi.ras.ru Abstract. In this work, we compare two extensions of two different topic models for the same problem of recommending full-text items: pre- viously developed SVD-LDA and its counterpart SVD-ARTM based on additive regularization. We show that ARTM naturally leads to the in- ference algorithm that has to be painstakingly developed for LDA. 1 Introduction Topic models are an important part of the natural language processing land- scape, providing unsupervised ways to quickly evaluate what a whole corpus of texts is about and classify them into well-defined topics. LDA extensions pro- vide ways to augment basic topic models with additional information and retool them to serve other purposes. In a previous work, we have combined the SVD and LDA decompositions into a single unified model that optimizes the joint likelihood function and thus infers topics that are especially useful for improv- ing recommendations. We have provided an inference algorithm based on Gibbs sampling, developing an approximate sampling scheme based on a first order ap- proximation to Gibbs sampling [1]. A recently developed ARTM approach [2–5] extends the basic pLSA model with regularizers and provides a unified way to add new additive regularizers; inference algorithms result with simple differenti- ation of the regularizers. In this work, we apply ARTM to the problem of adding SVD decompositions to a topic model; we show that one can automatically arrive at an inference algorithm very similar to our previous SVD-LDA approach. 2 LDA and SVD-LDA The graphical model of LDA [6, 7] is shown on Figure 1a. We assume that a corpus of D documents contains T topics expressed by W different words. Each document d ∈ D is modeled as a discrete distribution θ(d) on the set of topics: p(zw = j) = θ(d) , where z is a discrete variable that defines the topic of each word w ∈ d. Each topic, in turn, corresponds to a multinomial distribution (k) on words: p(w | zw = k) = φw . The model also introduces prior Dirichlet distributions with parameters α for the topic vectors θ, θ ∼ Dir(α), and β for (a) (b) Fig. 1: The (a) LDA and (b) sLDA graphical models. the word distributions φ, φ ∼ Dir(β). A document is generated word by word: for each word, we (1) sample the topic index k from distribution θ(d) ; (2) sample (k) the word w from distribution φw . Inference in LDA is usually done via either variational approximations or Gibbs sampling; we use the latter since it is easy to generalize to further extensions. In the basic LDA model, Gibbs sampling reduces to the so-called collapsed Gibbs sampling, where θ and φ variables are integrated out, and zw are iteratively resampled according to the following distribution: (d) (w) n−w,t +α n−w,t +β (d) p(zw = t | z −w , w, α, β) ∝ P  (d) P  (w0 )  , where n −w,t is the t0 ∈T n−w,t0 +α w0 ∈W n−w,t +β (w) number of words in document d chosen with topic t and n−w,t is the number of times word w has been generated from topic t apart from the current value zw ; both counters depend on the other variables z −w . Samples are then used (d) (w) n−w,t +α n−w,t +β to estimate model variables: θd,t = P  (d)  , φw,t = P  (w0 ) , t0 ∈T n−w,t0 +α w0 ∈W n−w,t +β where φw,t denotes the probability to draw word w in topic t and θd,t is the probability to draw topic t for a word in document d. The basic LDA model has been used and extended in numerous applica- tions; the relevant class of extensions for us now takes into account additional information that may be available together with the documents and may re- veal additional insights into the topical structure. For instance, the Topics over Time model and dynamic topic models apply when documents have timestamps of their creation (e.g., news articles or blog posts) [8–10], DiscLDA assumes that each document is assigned with a categorical label and attempts to utilize LDA for mining topic classes related to classification [11], the Author-Topic model incorporates information about the authors of a document [12, 13], and so on. The SVD-LDA model, presented in [1], can be regarded as an extension of the Supervised LDA (sLDA) model [14]. The sLDA graphical model is shown on Fig. 1b. In sLDA, each document is now augmented with a response variable y drawn from a normal distribution centered around a linear combination of the document’s topical distribution (z̄, average z variables in this document) with some unknown parameters b, a that are also to be trained: y ∼ N (y | b> z̄ + a, σ 2 ). The original work [14] presents an inference algorithm for sLDA based on variational approximations, but in this work we operate with Gibbs sampling which will be easier to extend to SVD-LDA later. Thus, we show an sLDA Gibbs sampling scheme. It differs from the original LDA in that the model likelihood  gets another factor  corresponding to the y variable: p(yd | z, b, σ 2 ) ∝ exp −(yd − b> z̄ − a)2 /2 , and the total likelihood is now p(z | w, y, b, σ 2 ) ∝ Q B(nd +α) Q B(nt +β) Q −(yd −b> z̄d −a)2 /2 d B(α) t B(β) de . On each iteration of the sampling algorithm, we now first sample z for fixed b and then train b for fixed (sampled) z. The sampling distributions for each z variable, according to the equation > 2 above, are p(zw = t | z −w , w, α, β) ∝ q(zw , t, z −w , w, α, β)e− 2 (yd −b z̄−a) = 1 (d) (w) 2 n−w,t +α n +β >  e− 2 (yd −b z̄−a) . The latter equation can be either used 1 P  P −w,t 0 (d) (w ) n−w,t0 +α n−w,t +β t0 w0 directly or further transformed by separating z −w explicitly. SVD-LDA considers a recommender system based on likes and dislikes, so it uses the logistic sigmoid σ(x) = 1/ (1 + exp(−x))  of a linear function to model the probability of a “like”: p(successi,a ) = σ b> z̄ + a . In this version of sLDA, the graphical model remains the same, only conditional probabilities change. The total likelihood is now p(z | w, y, b, α, β, σ 2 ) ∝ Y B(nd + α) Y B(nt + β) Y Y  yx   1−yx σ b> z̄ d + a 1 − σ b> z̄ d + a , B(α) t B(β) d d x∈Xd where Xd is the set of experiments (ratings) for document d, and yx is the binary result of one such experiment. The sampling procedure also remains the same, except that now we train logistic regression with respect to b, a for fixed z instead of linear regression, and the sampling probabilities for each z variable are now p(zw = t | z −w , w, α, β) ∝ Y h  iyx h  i1−yx q(zw , t, z −w , w, α, β) σ b> z̄ d + a 1 − σ b> z̄ d + a x∈Xd (d) (w) n−w,t + α n−w,t + β =P  P   esd log pd +(|Xd |−sd ) log(1−pd ) , (d) (w0 ) t0 ∈T n−w,t0 + α w0 ∈W n−w,t + β 1 where sd is the number of successful experiments among Xd , and pd = > . 1+e−b z̄d −a The SVD-LDA extension has been introduced in [1] as follows: for recommen- dations we use an SVD model with additional predictors corresponding to how much a certain user or group of user likes the topics trained in the LDA model; since our dataset is binary (like-dislike), we use a logistic version of the SVD model: p(successi,a ) = σ (r̂i,a ) = σ µ + bi + ba + qa> pi + θa> li , where pi may  be absent in case of cold start, and li may be shared among groups (clusters) of users. The total likelihood of the dataset with ratings comprised of triples D = {(i, a, r)} (user i rated item a as r ∈ {−1, 1}) is a product of the like- lihood of each rating (assuming, as usual, that they are independent): p(D | Q [r=1] [r=−1] µ, bi , ba , pi , qa , li , θa ) = D σ (r̂i,aP ) (1 − σ (r̂i,a )) , and the logarithm is log p(D | µ, bi , ba , pi , qa , li , θa ) = D ([r = 1] log σ (r̂i,a ) + [r = −1] log (1 − σ (r̂i,a ))) , where [r = −1] = 1 if r = −1 and [r = −1] = 0 otherwise, P and θa is the vector of topics trained for document a in the LDA model, θa = N1a w∈a zw , where Na is the length of document a. Sampling probabilities for each z variable now look like p(zw = t | z −w , w, α, β) ∝ q(zw , t, z −w , w, α, β)p(D | µ, bi , ba , pi , qa , li , θaw→t ) = (d) (w) n−w,t + α n−w,t + β SVD > w→t  e D log([r=−1]−σ(r̂i,a +li θa )) , P =P  P  0 (d) (w ) t0 ∈T n−w,t0 + α w0 ∈W n−w,t + β SVD where r̂i,a = µ + bi + ba + qa> pi , and θaw→t is the vector of topics for document a where topic t is substituted in place of zw . We see that in the formula above, to compute the sampling distribution for a single zw variable one has to take a sum over all ratings all users have provided for this document, and due to the presence of the sigmoid function one cannot cancel out terms and reduce the SVD sum to updating counts. It is possible to store precomputed values of r̂i,a in memory, but it does not help because the zw variables change during sampling, and when they do all values of σ(r̂i,a SVD + li> θaw→t ) also have to be recomputed for each rating from the database. To make the model feasible, a simplified SVD-LDA training algorithm was developed in [1] that could run reasonably fast on large datasets. It used a first order approximation to the log likelihood based on its Taylor series at zero: ∂ log p(D | li , θa , . . .) X SVD + θa> li ) li  = [r = 1] 1 − σ(r̂i,a ∂θa D  X −[r = −1]σ(r̂i,a + θa> li )li = SVD SVD + θa> li ) li . (1)  [r = 1] − σ(r̂i,a D + θa> li li . We can now precompute P SVD  We denote sa = D [r = 1] − σ r̂i,a sa (a vector over topics) for each document right after SVD training (with ad- ditional memory of the same size as the θ matrix) and use it in LDA sampling: p(zw = t | z −w , w, α, β) ∝ q(zw , t, z −w , w, α, β)p(D | µ, bi , ba , pi , qa , li , θaw→t ) (d) (w) n−w,t + α n−w,t + β w→t ≈P  P   esa θa , (d) (w0 ) t0 ∈T n−w,t0 + α w0 ∈W n−w,t + β (d) (w) n−w,t +α n−w,t +β and the latter is proportional to simply P  est be- (w0 )  P  (d) t0 ∈T n−w,t0 +α w0 ∈W n−w,t +β cause sa θaw→t = sa θa − sw zw + st zt , and the first two terms do not depend on t which is being sampled. Thus, the first order approximation yields a simple mod- ification of LDA sampling that incurs relatively small computational overhead as compared to the sampling itself. Model Topics Features λ NDCG AUC MAP WTA Top3 Top5 SVD 5 0.1 0.9814 0.8794 0.9406 0.9440 0.9434 0.9424 SVD 10 0.15 0.9815 0.8801 0.9405 0.9448 0.9434 0.9425 SVD 15 0.2 0.9815 0.8802 0.9405 0.9453 0.9435 0.9426 SVD 20 0.2 0.9816 0.8803 0.9406 0.9453 0.9437 0.9427 SVD-LDA 50 5 0.025 0.9829 0.8893 0.9418 0.9499 0.9466 0.9445 SVD-LDA 100 10 0.025 0.9829 0.8893 0.9418 0.9500 0.9465 0.9445 SVD-LDA 200 15 0.01 0.9830 0.8895 0.9417 0.9524 0.9470 0.9446 SVD-LDA-DEM 50 10 0.01 0.9840 0.8901 0.9428 0.9531 0.9481 0.9456 SVD-LDA-DEM 100 5 0.01 0.9840 0.8904 0.9428 0.9528 0.9480 0.9456 SVD-LDA-DEM 200 10 0.01 0.9840 0.8898 0.9428 0.9524 0.9481 0.9456 Table 1: Ranking metrics on the test set. Only the best results w.r.t. λ and the number of features are shown [1]. We have outlined a general approximate sampling scheme; several different variations are possible depending on which predictors are shared in the basic SVD model, p(successi,a ) = σ (r̂i,a ) . In general, a separate set of li features for every user would lead to heavy overfitting, so we used two variations: either share li = l among all users or share li = lc among certain clusters of users, preferably inferred from some external information, e.g., demographic features. Both vari- ations can be used for cold start with respect to users. Table 1 summarizes the results of experiments that show that SVD-LDA does indeed improve upon the basic LDA model [1]. 3 SVD-ARTM In recent works [2–4], K. Vorontsov and coauthors demonstrated that if one adds regularizers in the objective function on the training stage of the basic probabilistic Latent Semantic Analysis (pLSA) model, which actually predates LDA [15], one can impose a very wide variety of constraints on the resulting topic model. This approach has been called Additive Regularization of Topic Models (ARTM). In particular, the authors showed that one can formulate a regularizer that imposes constraints on the smoothness of topic-document and word-topic distributions that will correspond to the Bayesian approach expressed in LDA (i.e., it will smooth out the distributions). Formally speaking, for a set of regularizers Ri (Φ, Θ), i = 1..r, and regu- larization weights ρi , i = P 1..r, we Pcan extend the objective Prfunction to maxi- mize L(Φ, Θ) + R(Φ, Θ) = d∈D w∈W ndw log p(w | d) + i=1 ρi Ri (Φ, Θ). By Karush–Kuhn–Tucker conditions, any solution of the resulting problem satisfies the following system of equations: X X ptdw = norm+ t∈T (φwt θtd ) , nwt = ndw ptdw , ntd = ndw ptdw , d∈D w∈d     ∂R ∂R φwt = norm+ w∈W n wt + φ wt , θtd = norm+ t∈T n td + θtd , ∂φwt ∂θtd where norm+ denotes non-negative normalization: norm+ P max{xa ,0} a∈A xa = b∈A max{xb ,0} . This system of equations yields a natural iterative algorithm (Newton’s method) for finding the parameters φwt and θtd , equivalent to EM inference in pLSA; see [3] for a full derivation and a more detailed treatment. Thus, we have a model which is very easy to extend and which is computationally cheaper to train than the LDA model, especially LDA extensions that rely on Gibbs sampling. To extend ARTM with an SVD-based regularizer, we begin with a regularizer in the same form as in Section 2: the total likelihood of the dataset with ratings comprised of triples D = {(i, a, r)} (user i rated item a as r ∈ {−1, 1}) is a product of the likelihood of each rating, so its logarithm is R(Φ, Θ) = log p(D | µ, bi , ba , pi , qa , li , θa ) = X = ([r = 1] log σ (r̂i,a ) + [r = −1] log (1 − σ (r̂i,a ))) , D where [r = −1] = 1 if r = −1 and [r = −1] = 0 otherwise, andP θa is the vector of topics trained for document a in the LDA model, θa = N1a w∈a zw , where Na is the length of document a, and SVD r̂i,a = r̂i,a + θa> li = µ + bi + ba + qa> pi + θa> li . To add this regularizer to the pLSA model, we have to compute its partial derivatives with respect to the parameters: ∂R(Φ, Θ) ∂R(Φ, Θ) X SVD + θa> li ) li ;   = 0, = [r = 1] − σ(r̂i,a ∂φwt ∂θta (i,a,r)∈D note that the latter equality is exactly the same as (1) (hence we omit the derivation), only now it is a direct part of the algorithm rather than a first order approximation to the sampling. The final algorithm is, thus, to iterate the following: X X ptaw = norm+ t∈T (φwt θta ) , nwt = naw ptaw , nta = naw ptaw , a∈D w∈a φwt = norm+ w∈W nwt ,   X θta = norm+ SVD + θa> li ) li  .  t∈T nta + ρθta [r = 1] − σ(r̂i,a (i,a,r)∈D + θa> li li P SVD  Similar to SVD-LDA, we can precompute sa = D [r = 1] − σ r̂i,a (it is a vector over topics) for each document after SVD is trained and use it throughout a pLSA iteration. 4 Conclusion In this work, we have developed an ARTM regularizer that adds an SVD-based matrix decomposition model on top of ARTM. We have shown that the resulting inference algorithms closely match the inference algorithms developed in the SVD-LDA modification of LDA with a first-order approximation to the Gibbs sampling. In further work, we plan to implement this regularizer and incorporate it into the BigARTM library [2, 3]. Acknowledgements. This work was supported by the Russian Science Foundation grant no. 15-11-10019. References 1. Nikolenko, S.I.: SVD-LDA: Topic modeling for full-text recommender systems. In: Proc. 14th Mexican International Conference on Artificial Intelligence. LNAI vol. 9414, Springer (2015) 67–79 2. Vorontsov, K.: Additive regularization for topic models of text collections. Doklady Mathematics 89 (2014) 301–304 3. Potapenko, A., Vorontsov, K.: Robust pLSA performs better than LDA. In: Proc. 35th European Conf. on IR Research. LNCS vol. 7814, Springer (2013) 784–787 4. Vorontsov, K., Frei, O., Apishev, M., Romov, P., Suvorova, M., Yanina, A.: Non- bayesian additive regularization for multimodal topic modeling of large collections. In: Proc. of the 2015 Workshop on Topic Models: Post-Processing and Applications. TM ’15, New York, NY, USA, ACM (2015) 29–37 5. Sokolov, E., Bogolubsky, L.: Topic models regularization and initialization for regression problems. In: Proc. of the 2015 Workshop on Topic Models: Post- Processing and Applications. TM ’15, New York, NY, USA, ACM (2015) 21–27 6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003) 993–1022 7. Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101 (Suppl. 1) (2004) 5228–5335 8. Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, ACM (2006) 424–433 9. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proc. of the 23rd International Conference on Machine Learning, New York, NY, USA, ACM (2006) 113–120 10. Wang, C., Blei, D.M., Heckerman, D.: Continuous time dynamic topic models. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. (2008) 11. Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: Discriminative learning for dimensionality reduction and classification. Advances in Neural Information Pro- cessing Systems 20 (2008) 12. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Arlington, Virginia, United States, AUAI Press (2004) 487–494 13. Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., Steyvers, M.: Learning author-topic models from text corpora. ACM Trans. Inf. Syst. 28 (2010) 1–38 14. Blei, D.M., McAuliffe, J.D.: Supervised topic models. Advances in Neural Infor- mation Processing Systems 22 (2007) 15. Hoffmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42 (2001) 177–196