ARTM vs. LDA: an SVD Extension Case Study

                              Sergey Nikolenko1,2,3,4
        1
            Laboratory for Internet Studies, NRU Higher School of Economics
             2
                 Steklov Institute of Mathematics at St. Petersburg, Russia
               3
                  Kazan (Volga Region) Federal University, Kazan, Russia
                      4
                        Deloitte Analytics Institute, Moscow, Russia
                                sergey@logic.pdmi.ras.ru


      Abstract. In this work, we compare two extensions of two different
      topic models for the same problem of recommending full-text items: pre-
      viously developed SVD-LDA and its counterpart SVD-ARTM based on
      additive regularization. We show that ARTM naturally leads to the in-
      ference algorithm that has to be painstakingly developed for LDA.


1    Introduction
Topic models are an important part of the natural language processing land-
scape, providing unsupervised ways to quickly evaluate what a whole corpus of
texts is about and classify them into well-defined topics. LDA extensions pro-
vide ways to augment basic topic models with additional information and retool
them to serve other purposes. In a previous work, we have combined the SVD
and LDA decompositions into a single unified model that optimizes the joint
likelihood function and thus infers topics that are especially useful for improv-
ing recommendations. We have provided an inference algorithm based on Gibbs
sampling, developing an approximate sampling scheme based on a first order ap-
proximation to Gibbs sampling [1]. A recently developed ARTM approach [2–5]
extends the basic pLSA model with regularizers and provides a unified way to
add new additive regularizers; inference algorithms result with simple differenti-
ation of the regularizers. In this work, we apply ARTM to the problem of adding
SVD decompositions to a topic model; we show that one can automatically arrive
at an inference algorithm very similar to our previous SVD-LDA approach.


2    LDA and SVD-LDA
The graphical model of LDA [6, 7] is shown on Figure 1a. We assume that a
corpus of D documents contains T topics expressed by W different words. Each
document d ∈ D is modeled as a discrete distribution θ(d) on the set of topics:
p(zw = j) = θ(d) , where z is a discrete variable that defines the topic of each
word w ∈ d. Each topic, in turn, corresponds to a multinomial distribution
                                (k)
on words: p(w | zw = k) = φw . The model also introduces prior Dirichlet
distributions with parameters α for the topic vectors θ, θ ∼ Dir(α), and β for
                 (a)                                     (b)

             Fig. 1: The (a) LDA and (b) sLDA graphical models.


the word distributions φ, φ ∼ Dir(β). A document is generated word by word:
for each word, we (1) sample the topic index k from distribution θ(d) ; (2) sample
                                  (k)
the word w from distribution φw . Inference in LDA is usually done via either
variational approximations or Gibbs sampling; we use the latter since it is easy to
generalize to further extensions. In the basic LDA model, Gibbs sampling reduces
to the so-called collapsed Gibbs sampling, where θ and φ variables are integrated
out, and zw are iteratively resampled according to the following distribution:
                                 (d)               (w)
                                  n−w,t +α         n−w,t +β               (d)
p(zw = t | z −w , w, α, β) ∝ P     
                                     (d)
                                             P      
                                                        (w0 )
                                                               , where n
                                                                          −w,t is the
                              t0 ∈T n−w,t0 +α  w0 ∈W n−w,t +β
                                                                (w)
number of words in document d chosen with topic t and n−w,t is the number
of times word w has been generated from topic t apart from the current value
zw ; both counters depend on the other variables z −w . Samples are then used
                                            (d)                         (w)
                                            n−w,t +α                    n−w,t +β
to estimate model variables: θd,t = P        
                                               (d)
                                                         , φw,t = P      
                                                                             (w0 )
                                                                                   ,
                                        t0 ∈T n−w,t0 +α             w0 ∈W n−w,t +β

where φw,t denotes the probability to draw word w in topic t and θd,t is the
probability to draw topic t for a word in document d.
    The basic LDA model has been used and extended in numerous applica-
tions; the relevant class of extensions for us now takes into account additional
information that may be available together with the documents and may re-
veal additional insights into the topical structure. For instance, the Topics over
Time model and dynamic topic models apply when documents have timestamps
of their creation (e.g., news articles or blog posts) [8–10], DiscLDA assumes that
each document is assigned with a categorical label and attempts to utilize LDA
for mining topic classes related to classification [11], the Author-Topic model
incorporates information about the authors of a document [12, 13], and so on.
    The SVD-LDA model, presented in [1], can be regarded as an extension of
the Supervised LDA (sLDA) model [14]. The sLDA graphical model is shown
on Fig. 1b. In sLDA, each document is now augmented with a response variable
y drawn from a normal distribution centered around a linear combination of
the document’s topical distribution (z̄, average z variables in this document)
with some unknown parameters b, a that are also to be trained: y ∼ N (y |
b> z̄ + a, σ 2 ).
    The original work [14] presents an inference algorithm for sLDA based on
variational approximations, but in this work we operate with Gibbs sampling
which will be easier to extend to SVD-LDA later. Thus, we show an sLDA
Gibbs sampling scheme. It differs from the original LDA in that the model
likelihood
          gets another factor
                              corresponding to the y variable: p(yd | z, b, σ 2 ) ∝
exp −(yd − b> z̄ − a)2 /2 , and the total likelihood is now p(z | w, y, b, σ 2 ) ∝
Q B(nd +α) Q B(nt +β) Q −(yd −b> z̄d −a)2 /2
  d    B(α)    t    B(β)      de                   . On each iteration of the sampling
algorithm, we now first sample z for fixed b and then train b for fixed (sampled)
z. The sampling distributions for each z variable, according to the equation
                                                                             >     2
above, are p(zw = t | z −w , w, α, β) ∝ q(zw , t, z −w , w, α, β)e− 2 (yd −b z̄−a) =
                                                                       1

    (d)           (w)                         2
   n−w,t +α     n       +β              >
                            e− 2 (yd −b z̄−a) . The latter equation can be either used
                                1
P           P −w,t 0
       (d)               (w )
      n−w,t0 +α         n−w,t +β
t0                 w0
directly or further transformed by separating z −w explicitly.
    SVD-LDA considers a recommender system based on likes and dislikes, so it
uses the logistic sigmoid σ(x) = 1/ (1 + exp(−x))
                                                  of a linear function to model
the probability of a “like”: p(successi,a ) = σ b> z̄ + a . In this version of sLDA,
the graphical model remains the same, only conditional probabilities change.
The total likelihood is now p(z | w, y, b, α, β, σ 2 ) ∝
Y B(nd + α) Y B(nt + β) Y Y                           yx                  1−yx
                                          σ b> z̄ d + a      1 − σ b> z̄ d + a       ,
      B(α)       t
                      B(β)
 d                                       d x∈Xd

where Xd is the set of experiments (ratings) for document d, and yx is the binary
result of one such experiment. The sampling procedure also remains the same,
except that now we train logistic regression with respect to b, a for fixed z
instead of linear regression, and the sampling probabilities for each z variable
are now p(zw = t | z −w , w, α, β) ∝
                                   Y h             iyx h                i1−yx
     q(zw , t, z −w , w, α, β)        σ b> z̄ d + a       1 − σ b> z̄ d + a
                                 x∈Xd
                    (d)                (w)
                   n−w,t + α          n−w,t + β
             =P                P                esd log pd +(|Xd |−sd ) log(1−pd ) ,
                        (d)                (w0 )
               t0 ∈T n−w,t0 + α   w0 ∈W n−w,t + β

                                                                              1
where sd is the number of successful experiments among Xd , and pd =           >      .
                                                                         1+e−b z̄d −a
    The SVD-LDA extension has been introduced in [1] as follows: for recommen-
dations we use an SVD model with additional predictors corresponding to how
much a certain user or group of user likes the topics trained in the LDA model;
since our dataset is binary (like-dislike), we use a logistic version   of the SVD
model: p(successi,a ) = σ (r̂i,a ) = σ µ + bi + ba + qa> pi + θa> li , where pi may
                                                                    

be absent in case of cold start, and li may be shared among groups (clusters)
of users. The total likelihood of the dataset with ratings comprised of triples
D = {(i, a, r)} (user i rated item a as r ∈ {−1, 1}) is a product of the like-
lihood of each rating (assuming, as usual, that they are independent): p(D |
                                  Q           [r=1]                  [r=−1]
µ, bi , ba , pi , qa , li , θa ) = D σ (r̂i,aP
                                             )      (1 − σ (r̂i,a ))        , and the logarithm is
log p(D | µ, bi , ba , pi , qa , li , θa ) = D ([r = 1] log σ (r̂i,a ) + [r = −1] log (1 − σ (r̂i,a ))) ,
where [r = −1] = 1 if r = −1 and [r = −1] = 0 otherwise, P                  and θa is the vector of
topics trained for document a in the LDA model, θa = N1a w∈a zw , where Na is
the length of document a. Sampling probabilities for each z variable now look like
p(zw = t | z −w , w, α, β) ∝ q(zw , t, z −w , w, α, β)p(D | µ, bi , ba , pi , qa , li , θaw→t ) =

               (d)                       (w)
         n−w,t + α          n−w,t + β                        SVD   > w→t
                                         e D log([r=−1]−σ(r̂i,a +li θa )) ,
                                           P
   =P                P          0
             (d)                 (w )
     t0 ∈T n−w,t0 + α   w0 ∈W n−w,t + β

        SVD
where r̂i,a  = µ + bi + ba + qa> pi , and θaw→t is the vector of topics for document
a where topic t is substituted in place of zw . We see that in the formula above,
to compute the sampling distribution for a single zw variable one has to take a
sum over all ratings all users have provided for this document, and due to the
presence of the sigmoid function one cannot cancel out terms and reduce the
                                                                                SVD
sum to updating counts. It is possible to store precomputed values of r̂i,a         in
memory, but it does not help because the zw variables change during sampling,
and when they do all values of σ(r̂i,a  SVD
                                             + li> θaw→t ) also have to be recomputed
for each rating from the database.
    To make the model feasible, a simplified SVD-LDA training algorithm was
developed in [1] that could run reasonably fast on large datasets. It used a first
order approximation to the log likelihood based on its Taylor series at zero:

   ∂ log p(D | li , θa , . . .) X                  SVD
                                                       + θa> li ) li
                                                                 
                               =   [r = 1] 1 − σ(r̂i,a
            ∂θa
                                 D
                                           X
            −[r = −1]σ(r̂i,a + θa> li )li =
                               SVD                                 SVD
                                                                       + θa> li ) li . (1)
                                                                                 
                                                   [r = 1] − σ(r̂i,a
                                                     D

                                           + θa> li li . We can now precompute
                        P              SVD
                                                   
    We denote sa = D [r = 1] − σ r̂i,a
sa (a vector over topics) for each document right after SVD training (with ad-
ditional memory of the same size as the θ matrix) and use it in LDA sampling:


  p(zw = t | z −w , w, α, β) ∝ q(zw , t, z −w , w, α, β)p(D | µ, bi , ba , pi , qa , li , θaw→t )
                                          (d)                       (w)
                                         n−w,t + α          n−w,t + β          w→t
                               ≈P                    P                esa θa ,
                                             (d)                 (w0 )
                                     t0 ∈T n−w,t0 + α   w0 ∈W n−w,t + β

                                                         (d)                 (w)
                                                        n−w,t +α         n−w,t +β
and the latter is proportional to simply P                                           est be-
                                                                              (w0 )
                                                                  P      
                                                           (d)
                                                    t0 ∈T n−w,t0 +α  w0 ∈W n−w,t +β

cause sa θaw→t = sa θa − sw zw + st zt , and the first two terms do not depend on t
which is being sampled. Thus, the first order approximation yields a simple mod-
ification of LDA sampling that incurs relatively small computational overhead
as compared to the sampling itself.
     Model       Topics Features λ     NDCG AUC      MAP WTA Top3         Top5
     SVD                5        0.1   0.9814 0.8794 0.9406 0.9440 0.9434 0.9424
     SVD                10       0.15 0.9815 0.8801 0.9405 0.9448 0.9434 0.9425
     SVD                15       0.2   0.9815 0.8802 0.9405 0.9453 0.9435 0.9426
     SVD                20       0.2   0.9816 0.8803 0.9406 0.9453 0.9437 0.9427
     SVD-LDA     50     5        0.025 0.9829 0.8893 0.9418 0.9499 0.9466 0.9445
     SVD-LDA     100    10       0.025 0.9829 0.8893 0.9418 0.9500 0.9465 0.9445
     SVD-LDA     200    15       0.01 0.9830 0.8895 0.9417 0.9524 0.9470 0.9446
     SVD-LDA-DEM 50     10       0.01 0.9840 0.8901 0.9428 0.9531 0.9481 0.9456
     SVD-LDA-DEM 100    5        0.01 0.9840 0.8904 0.9428 0.9528 0.9480 0.9456
     SVD-LDA-DEM 200    10       0.01 0.9840 0.8898 0.9428 0.9524 0.9481 0.9456
Table 1: Ranking metrics on the test set. Only the best results w.r.t. λ and the
number of features are shown [1].


    We have outlined a general approximate sampling scheme; several different
variations are possible depending on which predictors are shared in the basic
SVD model, p(successi,a ) = σ (r̂i,a ) . In general, a separate set of li features for
every user would lead to heavy overfitting, so we used two variations: either share
li = l among all users or share li = lc among certain clusters of users, preferably
inferred from some external information, e.g., demographic features. Both vari-
ations can be used for cold start with respect to users. Table 1 summarizes the
results of experiments that show that SVD-LDA does indeed improve upon the
basic LDA model [1].

3    SVD-ARTM
In recent works [2–4], K. Vorontsov and coauthors demonstrated that if one
adds regularizers in the objective function on the training stage of the basic
probabilistic Latent Semantic Analysis (pLSA) model, which actually predates
LDA [15], one can impose a very wide variety of constraints on the resulting topic
model. This approach has been called Additive Regularization of Topic Models
(ARTM). In particular, the authors showed that one can formulate a regularizer
that imposes constraints on the smoothness of topic-document and word-topic
distributions that will correspond to the Bayesian approach expressed in LDA
(i.e., it will smooth out the distributions).
    Formally speaking, for a set of regularizers Ri (Φ, Θ), i = 1..r, and regu-
larization weights ρi , i = P 1..r, we
                                     Pcan extend the objective
                                                            Prfunction to maxi-
mize L(Φ, Θ) + R(Φ, Θ) = d∈D w∈W ndw log p(w | d) + i=1 ρi Ri (Φ, Θ). By
Karush–Kuhn–Tucker conditions, any solution of the resulting problem satisfies
the following system of equations:
                                           X                   X
       ptdw = norm+ t∈T (φwt θtd ) , nwt =    ndw ptdw , ntd =     ndw ptdw ,
                                           d∈D                    w∈d
                                                                          
                                ∂R                                      ∂R
    φwt = norm+
              w∈W  n wt + φ wt        ,       θtd = norm+
                                                        t∈T  n td + θtd        ,
                               ∂φwt                                     ∂θtd
where norm+ denotes non-negative normalization: norm+             P max{xa ,0}
                                                         a∈A xa =  b∈A max{xb ,0}
                                                                                  .
This system of equations yields a natural iterative algorithm (Newton’s method)
for finding the parameters φwt and θtd , equivalent to EM inference in pLSA;
see [3] for a full derivation and a more detailed treatment. Thus, we have a model
which is very easy to extend and which is computationally cheaper to train than
the LDA model, especially LDA extensions that rely on Gibbs sampling.
    To extend ARTM with an SVD-based regularizer, we begin with a regularizer
in the same form as in Section 2: the total likelihood of the dataset with ratings
comprised of triples D = {(i, a, r)} (user i rated item a as r ∈ {−1, 1}) is a
product of the likelihood of each rating, so its logarithm is

    R(Φ, Θ) = log p(D | µ, bi , ba , pi , qa , li , θa ) =
                            X
                         =         ([r = 1] log σ (r̂i,a ) + [r = −1] log (1 − σ (r̂i,a ))) ,
                                 D

where [r = −1] = 1 if r = −1 and [r = −1] = 0 otherwise, andP
                                                            θa is the vector
of topics trained for document a in the LDA model, θa = N1a w∈a zw , where
Na is the length of document a, and
                              SVD
                    r̂i,a = r̂i,a + θa> li = µ + bi + ba + qa> pi + θa> li .

To add this regularizer to the pLSA model, we have to compute its partial
derivatives with respect to the parameters:

       ∂R(Φ, Θ)               ∂R(Φ, Θ)          X
                                                                        SVD
                                                                            + θa> li ) li ;
                                                                                     
                = 0,                   =                  [r = 1] − σ(r̂i,a
        ∂φwt                    ∂θta
                                             (i,a,r)∈D


note that the latter equality is exactly the same as (1) (hence we omit the
derivation), only now it is a direct part of the algorithm rather than a first
order approximation to the sampling. The final algorithm is, thus, to iterate the
following:
                                         X                    X
     ptaw = norm+ t∈T (φwt θta ) , nwt =   naw ptaw , nta =       naw ptaw ,
                                                a∈D                        w∈a

        φwt = norm+
                  w∈W nwt ,
                                                                                    
                                            X
        θta = norm+                                                SVD
                                                                       + θa> li ) li  .
                                                                                 
                  t∈T
                      nta + ρθta                    [r = 1] − σ(r̂i,a
                                         (i,a,r)∈D


                                                                    + θa> li li
                                              P                 SVD
                                                                            
Similar to SVD-LDA, we can precompute sa = D [r = 1] − σ r̂i,a
(it is a vector over topics) for each document after SVD is trained and use it
throughout a pLSA iteration.


4     Conclusion
In this work, we have developed an ARTM regularizer that adds an SVD-based
matrix decomposition model on top of ARTM. We have shown that the resulting
inference algorithms closely match the inference algorithms developed in the
SVD-LDA modification of LDA with a first-order approximation to the Gibbs
sampling. In further work, we plan to implement this regularizer and incorporate
it into the BigARTM library [2, 3].

Acknowledgements. This work was supported by the Russian Science Foundation
grant no. 15-11-10019.

References
 1. Nikolenko, S.I.: SVD-LDA: Topic modeling for full-text recommender systems. In:
    Proc. 14th Mexican International Conference on Artificial Intelligence. LNAI vol.
    9414, Springer (2015) 67–79
 2. Vorontsov, K.: Additive regularization for topic models of text collections. Doklady
    Mathematics 89 (2014) 301–304
 3. Potapenko, A., Vorontsov, K.: Robust pLSA performs better than LDA. In: Proc.
    35th European Conf. on IR Research. LNCS vol. 7814, Springer (2013) 784–787
 4. Vorontsov, K., Frei, O., Apishev, M., Romov, P., Suvorova, M., Yanina, A.: Non-
    bayesian additive regularization for multimodal topic modeling of large collections.
    In: Proc. of the 2015 Workshop on Topic Models: Post-Processing and Applications.
    TM ’15, New York, NY, USA, ACM (2015) 29–37
 5. Sokolov, E., Bogolubsky, L.: Topic models regularization and initialization for
    regression problems. In: Proc. of the 2015 Workshop on Topic Models: Post-
    Processing and Applications. TM ’15, New York, NY, USA, ACM (2015) 21–27
 6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine
    Learning Research 3 (2003) 993–1022
 7. Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National
    Academy of Sciences 101 (Suppl. 1) (2004) 5228–5335
 8. Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model
    of topical trends. In: Proc. of the 12th ACM SIGKDD International Conference
    on Knowledge Discovery and Data Mining, New York, NY, USA, ACM (2006)
    424–433
 9. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proc. of the 23rd International
    Conference on Machine Learning, New York, NY, USA, ACM (2006) 113–120
10. Wang, C., Blei, D.M., Heckerman, D.: Continuous time dynamic topic models. In:
    Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. (2008)
11. Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: Discriminative learning for
    dimensionality reduction and classification. Advances in Neural Information Pro-
    cessing Systems 20 (2008)
12. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for
    authors and documents. In: Proceedings of the 20th Conference on Uncertainty
    in Artificial Intelligence, Arlington, Virginia, United States, AUAI Press (2004)
    487–494
13. Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., Steyvers, M.: Learning
    author-topic models from text corpora. ACM Trans. Inf. Syst. 28 (2010) 1–38
14. Blei, D.M., McAuliffe, J.D.: Supervised topic models. Advances in Neural Infor-
    mation Processing Systems 22 (2007)
15. Hoffmann, T.: Unsupervised learning by probabilistic latent semantic analysis.
    Machine Learning 42 (2001) 177–196