=Paper= {{Paper |id=Vol-1308/paper5 |storemode=property |title=Faster PLSA |pdfUrl=https://ceur-ws.org/Vol-1308/paper5.pdf |volume=Vol-1308 |dblpUrl=https://dblp.org/rec/conf/syrcodis/AvanesovK14 }} ==Faster PLSA== https://ceur-ws.org/Vol-1308/paper5.pdf
                                                 Faster PLSA

                                    Valeriy Avanesov                    Ilya Kozlov
                   The Institute for System Programming (ISP) of the Russian Academy of Sciences
                                       kozlov-ilya@ispras.ru, avanesov@ispras.ru



                        Abstract                              Denote ϕwt = p(w|t) and θtd = p(t|d). One may obtain
                                                              ϕwt and θtd as solution of optimization problem
      Probabilistic Latent Semantic Analysis (PLSA)                         XX           X
      is an effective technique for information re-                    L=            log     ϕwt θtd → max     (2)
      trieval, but it has a serious drawback: it con-                          d∈D w∈d         t
      sumes a huge amount of computational re-
                                                              with boundary
      sources, so it is hard to train this model on a
      large collection of documents. The aim of this                           X                     X
      paper is to improve time efficiency of the train-                   ∀t       ϕwt = 1, ∀d           θtd = 1    (3)
                                                                               w                     w
      ing algorithm. Two different approaches are
      explored: one is based on efficient finding of          and
      an appropriate initial approximation, the idea of                      ∀t, w ϕwt ≥ 0, ∀d, t θwt ≥ 0           (4)
      another is that for the most of collection topics
      may be extracted from relatively small fraction         1.3     Topic modeling as matrix decomposition
      of the data.
                                                              1.3.1    Kullback-Leibler divergence
1     Introduction                                            Kullback-Leibler divergence is a non-negative measure
Topic modeling is an application of machine learning          of diffrence between two different probability distribu-
to text analysis. Topic modeling is useful for different      tion:
                                                                                           n         
text analysis tasks, for example: document categoriza-
                                                                                          X          pi
                                                                           KL(pi ||qi ) =     pi ln                (5)
tion [1], spam detection [2], phishing detection [3] and                                  i=1
                                                                                                     qi
many other applications.
One of the widespread algorithms is Probabilistic Latent      Consider an empirical distribution p̂i and some paramet-
Semantic Analysis (PLSA), suggested by Thomas Hof-            ric distribution qi = qi (α) which is used to explain p̂i
mann in [4].                                                  . Easy to see that in this case minimization of KL–
                                                              divergence is equivalent to estimation of α by maximum-
1.1    Generative model                                       likelihood:
                                                                                      n                
PLSA is based on generative model ”bag of words”: ev-                                X             pi
                                                                   KL(pi ||qi (α)) =     pi ln            → min    (6)
ery document is assumed to be a multinomial distribu-                                            qi (α)       α
                                                                                     i=1
tion over topics. Every topic is a multinomial distribution
over words. Generation model may be defined as follow:                             n
                                                                                   X
                                                                               ⇔         pi ln(qi (α)) → max
    • For every position in document d i.i.d choose topic                          i=1
                                                                                                          α
      t from distribution of topics by document
                                                                Thus one can easily see that (2) equivalent to weighted
    • Choose word w from topic t                              Kullback-Leibler divergence minimization:

The aim of topic modeling is to recover topics and distri-
                                                                                                      !
                                                                    X              ndw X
bution of document by topics.                                           nd KLw          ||     ϕwt θtd → min        (7)
                                                                                    nd                       Φ,Θ
                                                                       d∈D                     t∈T
1.2    Topic modeling as optimization problem
                                                              where nwd – number of words w in document d, nd –
According to generative model one can estimate proba-         number of words in document d.
bility to observe collection D as:
                     Y YX                                     1.3.2    Matrix decomposition
            p(D) =                p(t|d)p(w|t)    (1)
                                t
                                                              Denote empirical distribution of words by document as
                     d∈D w∈d
                                                              p̂(w, d) = nnwd
                                                                            d
                                                                              . According to this notation one can con-
Proceedings of the Tenth Spring Researcher’s Colloquium on    sider the problem (2) as matrix decomposition:
Databases and Information Systems, Veliky Novgorod, Rus-
sia, 2014                                                                                F ≈KL ΦΘ                   (8)
where matrix F = (p̂(w, d))W ×D is empirical distri-       words) can be decreased by text normalization (remov-
bution of words by document, matrix Φ = (ϕwt )W ×D         ing of stop-words, lowercasing, etc). Number of itera-
is distribution of words by topics and matrix Θ =          tions until convergence depends of initial approximation
(θtd )T ×D is distribution of documents by topics. Thus,   of PLSA parameters Φ and Θ, so a good initial approx-
our optimization problem may be rewritten in Kullback-     imation can reduce the number of iterations until con-
Leibler notation as                                        vergence. The current study presents an efficient ap-
                                                           proach to find a beneficial initial approximation. The
                  KL(F, ΦΘ) → min                    (9)   other method of computation time reduction based on
                                                           idea that matrix Φ may be obtained on small represen-
Thus PLSA may be observed as stochastic matrix de-         tative part of documents collection.
composition.

1.4   Expectation-Maximization algorithm                   2     Related Work
Unfortunately (2) has no analytical solution. Thus we      The original algorithm was described in 1999 in [4].
use Expectation - Maximization (EM) algorithm. This        Since 1999 numerous papers were devoted to PLSA,
algorithm consists of two steps:                           but only a few of them are devoted to time efficiency
                                                           improvement. In [5] authors improve time efficiency by
 1. Estimation of number ndwt of words w, produced         parallelization of the algorithm using OpenMP. Authors
    by topic t in document d. (E - step)                   report 6 times speed-up on 8 CPU machine. Work [6]
                                                           improves the result of a previous work by using MPI.
 2. Optimization of distribution of documents by topics    But both of these studies try to solve the problem of time
    and optimization of distribution of topics by words    efficiency purely by programming methods.
    relying on the ndwt values obtained during E - step
    . (M - step)                                               In [7] Farahat uses LSA for finding an initialization
One can estimate ndwt as follows:                          for PLSA. LSA is based on SVD 1 matrix decomposition
                                                           in L2 norm and lacks probabilistic interpretation. PLSA
                      nwd p(w|t)p(t|d)                     performs stochastic matrix decomposition based on
               ndwt = P                            (10)    Kullback-Leibler divergence and has a simple probabil-
                        t p(w|t)p(t|d)
                                                           ity interpretation, but it inherits the problem of every
where nwd – number of words w in document d. Thus,         non-convex optimization algorithm: it may converge to
probability p(w|t) may be estimated as                     a local minimum instead of the global one. Combination
                               P                           of LSA and PLSA leverages the best features of these
                     nwt          d ndwt                   models: usage of LSA training result as an initial
            p(w|t) =      =   P   P           (11)         approximation helps to avoid convergence to a poor
                      nt        w    d ndwt
                                                           local minimum. But the problem of time efficiency is
Where nwt – number of words w, produced by topic t:        not explored. In [7] it is shown that L2 norm usage is
                        X                                  appropriate to find an initialization for PLSA inference
                 nwt =      ndwt                (12)       algorithm, we will use this result in our work.
                             d∈D                           An idea of obtaining a distribution over topics for a
                                                           document not included in a collection that PLSA was
nt – number of words, produced by topic t                  initially trained on was expressed in [8]. The author
                         X                                 suggests to perform this through EM-scheme holding
                   nt =      nwt                   (13)    matrix Φ fixed. However, he proposes this method only
                            w∈V                            for query processing but not for PLSA training speed-up.
Similarly for p(t|d):
                                   ntd                     3     Proposed approach
                        p(t|d) =                   (14)
                                   nd                      In this work we present two different approach for com-
Where nd – number of words in document d, ntd – es-        putational time reduction. One is based on finding initial
timated number of words in document d, produced by         approximation and reduction of number of iteration to
topic t:                                                   convergence. The other is based on obtaining Φpart on
                        X                                  representative sample, fixing Φpart and then obtaining Θ
                  ntd =     ndwt               (15)
                                                           on whole collection.
                            w∈V

   As one can see, the asymptotic time of this algorithm   3.1   Finding initial approximation
is O(D×V ×T ×I) where D – the number of documents,
V – average number of distinct words in document, T -      In this work we do not use LSA nor clustering methods.
– number of topics and I – number of iterations until      Instead we take a subset of our collection (for example
convergence. Inference of the PLSA on a large dataset      10%), apply PLSA to this sample and calculate an ini-
requires a lot of time thus the methods of decreasing of   tial approximation using obtained matrix Φ. Computa-
computation time are important.                                1 SVD – Singular value decomposition, a factorization of a matrix
Number of topics and number of documents are defined       into the product of a unitary matrix, a diagonal matrix, and another
by application. Size of vocabulary (number of distinct     unitary matrix
tion time of PLSA is proportional to the number of doc-        L2 would not be a solution in our space with Kullback-
uments in collection and training PLSA on 10% part of          -Leibler divergence, but we are not looking for the exact
collection is at least ten times faster than on the whole      solution, but for an appropriate initialization.
collection (per iteration).                                    Assume that all words are replaced by their serial num-
                                                               bers in the vocabulary. Let us consider topics and doc-
3.1.1   Taking a sample                                        uments as vector in RV , where V stands for vocabulary
                                                               size. i − th coordinate in vector-document represents the
In order to take a representative sample we need to take
                                                               number of times a word with the number i occurs in the
a random sample. The exact size of sample is not im-
                                                               document:
portant so we use a rather simple scheme: we include
documents independently with probability 10%. So we                   d~(i) = #(word i occur in document d)
take a representative part of collection and its size is ap-
proximately 10% of size of the whole collection.               For topic-vector i − th coordinate shows probability to
                                                               generate word i from this topic t:
3.1.2   Initial approximation of Φ (words-topics)
                                                                                           ~t(i) = p(i|t)
For training PLSA on the sample we use a random ini-
tialization. Computation time is linear by the number of       Consider vector space, formed by topic-vectors. Topics
the documents in the collection, so the process of training    form basis in this space. One can find an initial approx-
turns out to be relatively fast. The obtained matrix Φpart     imation for distribution of documents d by topics as or-
can be used as initial approximation of matrix Φ for the       thogonal projection to this space:
whole collection. An issue of this approach is that some
words from the vocabulary do not occur in the sample                                       d~ = d~⊥ + d~k                      (19)
and every topic in matrix Φpart has zero weight for these      Where
words. If we would use matrix Φpart as is, these proba-                               ∀~t ∈ Θ (d~⊥ , ~t) = 0                   (20)
bilities would stay zero on every step (10), (2). It would
have disastrous result for likelihood (or perplexity) of our   For computation efficiency we perform orthogonaliza-
model:                                                         tion and normalization of basis {~t} → {~t0 }, where
                           YX                                  ∀i 6= j(t0i , t0j ) = 0 and ∀i(t0i , t0i ) = 1. It allows us to
           likelihood =            p(w|t)θt ≡ 0         (16)   find the projection faster and simpler:
                            w∈d t∈Θ                                                          X
                                                                                       d~k =       αi ~t0 i               (21)
because some word w∗ would have zero probability for                                                 i
every topic. Thus some kind of smoothing is necessary.
In this work we use a trivial one:                             where ~t0 i – i-th vector of the orthonormal basis
                                                                                                           X
 1. add some constant for every position in every topic              ~ ~t0 i ) = (d~k , ~t0 i ) = (~t0 i ,
                                                                   (d,                                       αi ~t0 i ) = αi   (22)
                                                                                                           i
                   ∀t, w p(w|t) += const
                                                               Where scalar product is defined as follows
     In this work we use                                                                            V
                                                                                                    X
                                 1                                                    (~x, ~y ) =         xi ∗ y i             (23)
                  const =                                                                           i=1
                          vocabularySize
                                                               Document-vector expansion in the orthonormal basis is
 2. normalize:              X                                  obtained, so one can return to topics basis and perform
                       ∀t       p(w|t) = 1                     normalization as follows:
                            w
                                                                                      X
                                                                                         dt = 1                    (24)
3.1.3   Initial approximation of Θ (document-topics)                                        t∈Θ

During the previous step we found an initial approxima-        Where dt – weight of topic t in document d
tion for matrix Φ (words by topic). Now we have to find           Due to the nature of L2 norm some topic weight may
an initial approximation for matrix Θ (documents-topics)       be too small or even negative, so smoothing is necessary.
given Φ. Its every column Θd. can be found as a solution       We do it analogously to the previous subsection:
of the following optimization problem:                           1. Replace negative weight by zero
                                       YX
 Θd. = arg max P (d|Θ) = arg max              p(w|t)θwt          2. add some constant for every weight. In this paper
              θ                        θ                            we use
                                           w∈d t∈Θ
                                                       (17)                                           1
with boundaries                                                                    const =
                        X                                                                       numberOf T opics
                              θt = 1                   (18)
                        t∈Θ                                      3. perform normalization
But our aim is to decrease the training time and compu-                                 X
tation method of solving this problem is not fast enough.                            ∀d   dt = 1
We propose to find Θd. in the norm L2 . The solution in                                             t∈Θ
3.2   Fixed Φ approach                                             The denominator is independent of t, thus
This approach is based on the fact that matrix Φ may be                                   X
found by training PLSA on representative subset D0 ⊂                            θtd ∝          ndw p(w|t, d)          (33)
D. The rows of the matrix Θ may be obtained through                                      w∈W

the following constrained optimization problem indepen-           Primal feasibility is easily verifiable by θtd summa-
dently for each document d ∈ D:                                tion by t.
                X           X                                     Dual feasibility follows from (32) and probability
      L(θ·d ) =      ndw ln    φ̂wt θtd → max      (25)        non-negativeness.
                                                  θ·d
                 w∈W              t∈T                             So this point satisfies the Karush-Kuhn-Tucker condi-
                        X                                      tions.
                              θtd = 1                   (26)      Also, one can easily see that θtd ≥ 0, thus we have
                        t∈T                                    found a solution of the problem (25), (26), (27).
                            θtd ≥ 0                     (27)
                                                               4     Experiments
  This approach consists of the following steps:
                                                               We conduct two kinds of experiments: we evaluate per-
 1. Take a random subset of documents D0 ∈ D of suf-           plexity [9] for our approaches and for classical PLSA
    ficient size s                                             and compare classification performance on topics distri-
                                                               butions, obtained by PLSA and by our approaches.
 2. Obtain Φ̂ through training PLSA on D0 using EM
    algorithm described in [4]
                                                               4.1   Datasets
 3. Obtain θ·d through solving optimization problem
    (25), (26), (27) for each document d ∈ D                   Both experiments were conducted on three datasets:
                                                               tweets, news articles, abstracts of scientific papers.
  The third step needs some explanation.
  We solve this problem with EM-algorithm.                         • Twitter dataset
  E-step estimate the probabilities p(t|d, w) as                     Twitter dataset contains tweets, posted by 15000
                                                                     Twitter users, written in English. We merge all
                              φwt θtd                                tweets, posted by single user into a single doc-
                 p(t|d, w) = P                          (28)         ument. Every document contains approximately
                               φwτ θτ d
                                  τ ∈T                               1000 tweets. Documents with less than 50 words
                                                                     are omitted.
   M-step In order to solve the problem (25), (26), (27)
temporary omit the non-negativeness constraint (27) –              • The 20 Newsgroups data set 2
we will see that the solution is non-negative.                       The 20 Newsgroups dataset is often used for topic
   Lagrange function for problem (25), (26) takes the                modeling testing on text categorization. It contains
form:                                                                short news articles on one of twenty newsgroups.
                                                                     It is a collection of approximately 20,000 news-
             X              X                 X                      group documents, partitioned (nearly) evenly across
 L(θ·d ) =         ndw ln         φ̂wt θtd −λ(  θtd −1) (29)         20 different newsgroups.
             w∈W            t∈T             t∈T
                                                                   • arxive 3
  Take a derivative:
                                                                     The third data set consists of abstracts of scientific
                                                                     articles. It consists of approximately 900000 ab-
      ∂L            X       φ̂wt                                     stracts from 6 areas: Physics, Mathematics, Com-
           (θ·d ) =   ndw P           −λ=0              (30)         puter Science, Quantitative Biology, Quantitative
      ∂θtd                   φ̂wt θtd
                    w∈W
                                  t∈T
                                                                     Finance, Statistics. Distribution of articles by ar-
                                                                     eas is not uniform: Physics contains 600 thousands,
   Move λ to the right part and multiply both sides by               Mathematics contains 270 thousands and Quantita-
θtd and according to (28)                                            tive Biology only 5 thousands. Abstracts with less
               X                                                     than 20 words are omitted. For experiments with
                    ndw p(w|t, d) = λθtd          (31)               fixed Φ we omit small area and take only Physics,
                 w∈W                                                 Mathematics and Computer Science.
  Now sum both sides by every t ∈ T                            Some text normalization is performed: stoplisting, low-
            X X                                                ercasing, rare words removing (ones that occur less than
                     ndw p(w|t, d) = λ                  (32)   5 times in the whole collection).
                 w∈W t∈T

   From equation (31) obtain θtd and substitute λ from         4.2   Initial approximation
(32):
                                                               Four types of initial approximation are compared:
                 X           p(w|t, d)
         θtd =         ndw P P                                     2 http://qwone.com/
                                                                                         ~jason/20Newsgroups/
                               ndω p(ω|τ, d)                       3 http://arxiv.org/
                 w∈W
                             ω∈W τ ∈T
   • Random initial approximation for matrix Φ (words
     by topic) and matrix Θ (document by topic). De-
     noted ”randomly”.
   • Calculate an initial approximation for matrix Φ on
     sample. Use random initial approximation for ma-
     trix Θ. Denoted ”phi”.
   • Calculate an initial approximation for matrix Θ on
     sample. Use random initial approximation for ma-
     trix Φ. Denoted ”theta”.
   • Calculate an initial approximation for matrix Θ and
     matrix Φ on sample. Denoted ”full”.

4.2.1   Perplexity depending on initial approximation
We evaluate the dependence of perplexity on different                  Figure 3: Perplexity for scientific articles
types of initial approximation. During these experiments
the number of iterations is fixed and equals 100. The             As one can see all the types of initial approximation
number of topics is 25 for every experiment.                   decrease perplexity of model. Model with initial, finding
   In figure 1, figure 2 and figure 3 one can see perplexity   by our approach converges faster then model with
depending on number of iterations for different datasets       random initial approximation. The same behavior is
and different types of initial approximation. The keys are     observed for every data set. Perplexity values eventually
given above in the beginning of Section 4.2                    obtained in 100 iterations for different datasets and
                                                               different types of initialization are presented in table 1.


                                                                          Table 1: Perplexity in 100 iterations
                                                                initialization news articles tweets          arxiv
                                                                random           3006.86          9287.68 1575.58
                                                                Θ only           2913.73          9111.91 1551.84
                                                                Φ only           2993.90          9143.94 1559.13
                                                                Θ and Φ          2973.90          8816.67 1561.07



                                                               4.2.2   Computation time depending on initial ap-
                                                                       proximation
                                                               In these experiments we evaluate training time depend-
                                                               ing on initial approximation. We perform iterations until
        Figure 1: Perplexity for Twitter data set              the stop criterion is satisfied: change of perplexity is less
                                                               than 1 five times in a row. Choosing a threshold is not
                                                               the aim of this work. Similar results were observed for a
                                                               wide range of thresholds. Difference in dispersion is less
                                                               than standard deviation. Results for different datasets
                                                               and different types of initialization are presented in ta-
                                                               bles 2, 3 and 4. (It include all training time: training on
                                                               sample, orthogonalization, finding initial approximation
                                                               for Θ, training on whole collection)

                                                               Table 2: Computation time depending on initial approx-
                                                               imation for Twitter dataset
                                                                initialization perplexity time, sec #iteration
                                                                random          9258.1     7164.9      114
                                                                Θ only          9143.5     4131.9      57
                                                                Φ only          9213.2     4814.9      67
                                                                Θ and Φ         8856.5     3722.6      48
         Figure 2: Perplexity for news articles
                                                                   It can be seen that our approach decreases calculation
                                                               time 1.5-2 times in every data set. Perplexity in mod-
                                                               els with initial approximation, achieved by our approach
                                                               is less or equal to the perplexity of model with random
                                                               initial distribution.
Table 3: Computation time depending on initial approx-
imation for news articles
 initialization perplexity time, sec #iteration
 random          2996.1     161.5       110
 Θ               2946.9     102.3       64
 Φ               3020.6     97.5        61
 Θ and Φ         2999.4     106.6       67


Table 4: Computation time depending on initial approx-
imation for Arxiv
 initialization perplexity time, sec #iteration
 random          1592.8     4017.8      59
 Θ               1560.9     2413.7      30
 Φ               1587.7     2012.9      25
 Θ and Φ         1583.7     1893.3      21
                                                              Figure 5: Perplexity depending on training subset size
4.3     Fixed Φ approach                                      for Twitter dataset
4.3.1    Perplexity
Perplexity inspection is a common way to compare dif-
ferent topic models [9] [10] [11]. We compare our model
with varying size of training subset with PLSA on differ-
ent datasets. One can see the results on Figures 4, 5 and
6.




                                                              Figure 6: Perplexity depending on training subset size
                                                              for News articles dataset
                                                              the whole collection. Then we estimate the quality of
                                                              classification by cross-validation with 10 folds. In both
                                                              experiments we use 20 topics. For classification we use
                                                              random forest classifier from package scikit-learn 4 . The
Figure 4: Perplexity depending on training subset size        results can be found on Figures 7 and 8. Majority to mi-
for arxiv dataset                                             nority class ratio for Twitter dataset is 1.17
                                                                 As one can see our approximation exhibits compara-
    As one can see perplexity values for PLSA and for         ble result if the training dataset is not too small. It works
our approximation are nearly equal, especially for such a     noticeably better for a large collection. Another impor-
large collections as arxiv or Twitter dataset. Worth men-     tant observation is that in all the experiments the curve
tioning that all the perplexity curves have a ”horizontal     reaches a plateau, it confirms that matrix Φ may be found
tail” – Φ matrix is not inferred better on a larger sample.   by training PLSA on a representative subset.
It means that the size of a sample needed to infer Φ ma-
trix does not depend on the size of a dataset. This fact is   4.3.3   Time efficiency
especially significant for training on huge datasets.
                                                              One of the aims of our work is time efficiency improve-
                                                              ment. We compare training time for PLSA and for our
4.3.2    Text categorization                                  approximation with |D0 | = 14 |D|. In Table 5 calculation
The other way to compare different topic models is ap-        time for PLSA and for our approximation (time to train
plication task, for example document categorization.          PLSA on the subset is included) are presented.
In these experiments we classify news articles by cat-           Another important characteristic is average time to
egories and Twitter users by gender treating topic dis-       process one document with our approximation and
tributions as features. We obtain topic distributions by      PLSA. Obtained values and speed-up in comparison to
training PLSA on the whole collection and by our ap-             4 http://scikit-learn.org/stable/modules/generated/
proximation with topics, trained on varying fraction of       sklearn.ensemble.RandomForestClassifier.html
                                                             and other is based on fixing Φ matrix and tested these
                                                             methods on three different datasets. Method, based on
                                                             finding initial approximation demonstrate the same be-
                                                             havior on every used dataset, the calculation time and
                                                             number of iterations to converge is decreased, yet the
                                                             quality of topic model does not decrease. We confirm
                                                             that transition from Kullback-Leibler divergence to L2
                                                             norm is appropriate to find an initial approximation for
                                                             PLSA.
                                                             Method, based on finding initial approximation demon-
                                                             strate more significant speed-up, but precision is drop.
                                                             However drop of precision is not significant, especially
                                                             on large datasets.

                                                             References
                                                              [1] Timothy N. Rubin, America Chambers, Padhraic
       Figure 7: Classification Twitter user by gender            Smyth, and Mark Steyvers. Statistical topic mod-
                                                                  els for multi-label document classification. Mach.
                                                                  Learn., 88(1-2):157–208, July 2012.
                                                              [2] Cailing Dong and Bin Zhou. Effectively detect-
                                                                  ing content spam on the web using topical diversity
                                                                  measures. Web Intelligence and Intelligent Agent
                                                                  Technology, IEEE/WIC/ACM International Confer-
                                                                  ence on, 1:266–273, 2012.
                                                              [3] Venkatesh Ramanathan and Harry Wechsler.
                                                                  phishgillnet–phishing detection methodology us-
                                                                  ing probabilistic latent semantic analysis, adaboost,
                                                                  and co-training. EURASIP Journal on Information
                                                                  Security, 2012(1):1, 2012.
                                                              [4] Thomas Hofmann. Probabilistic latent semantic in-
                                                                  dexing. In Proceedings of the 22Nd Annual Inter-
                                                                  national ACM SIGIR Conference on Research and
                                                                  Development in Information Retrieval, SIGIR ’99,
     Figure 8: Classification news articles by category           pages 50–57, New York, NY, USA, 1999. ACM.
                                                              [5] Chuntao Hong, Yurong Chen, Weimin Zheng, Jiu-
    Dataset         PLSA, sec     approx.,       speed-up,        long Shan, Yurong Chen, and Yimin Zhang. Paral-
                                  sec            times            lelization and characterization of probabilistic la-
    Arxiv           4333          1972           2.2              tent semantic analysis. In Parallel Processing,
    Twitter         6384          2702           2.4              2008. ICPP ’08. 37th International Conference on,
    News            164           64             2.6              pages 628–635, 2008.
    articles
                                                              [6] Raymond Wan, VoNgoc Anh, and Hiroshi Mamit-
               Table 5: Time to process collection                suka. Efficient probabilistic latent semantic anal-
                                                                  ysis through parallelization.     In GaryGeunbae
time needed to train original PLSA on a single document           Lee, Dawei Song, Chin-Yew Lin, Akiko Aizawa,
in average are given in Table 6 (time to train PLSA on            Kazuko Kuriyama, Masaharu Yoshioka, and Tet-
the subset is not included).                                      suya Sakai, editors, Information Retrieval Technol-
                                                                  ogy, volume 5839 of Lecture Notes in Computer
    Dataset         PLSA, ms      approx.,       speed-up,
                                                                  Science, pages 432–443. Springer Berlin Heidel-
                                  ms             times
                                                                  berg, 2009.
    Arxiv           14.3          3.0            4.8
    Twitter         414.2         70.7           5.9          [7] Ayman Farahat. F.r.: Improving probabilistic latent
    News            9.0           0.7            12.9             semantic analysis using principal component anal-
    articles                                                      ysis. In In: Eleventh Conference of the European
                                                                  Chapter of the Association for Computational Lin-
       Table 6: Average time to process one document              guistics (EACL, 2006.
                                                              [8] Thomas Hofmann. Probabilistic latent semantic in-
                                                                  dexing. In Proceedings of the 22Nd Annual Inter-
5      Conclusion
                                                                  national ACM SIGIR Conference on Research and
We develop two methods for computation time reduction             Development in Information Retrieval, SIGIR ’99,
one is based on finding appropriate initial approximation         pages 50–57, New York, NY, USA, 1999. ACM.
 [9] Anna Potapenko and Konstantin Vorontsov. Robust
     plsa performs better than lda. In ECIR, pages 784–
     787, 2013.
[10] David M. Blei, Andrew Y. Ng, and Michael I. Jor-
     dan. Latent dirichlet allocation. J. Mach. Learn.
     Res., 3:993–1022, March 2003.
[11] Hanna M. Wallach, Iain Murray, Ruslan Salakhut-
     dinov, and David Mimno. Evaluation methods
     for topic models. In Proceedings of the 26th An-
     nual International Conference on Machine Learn-
     ing, ICML ’09, pages 1105–1112, New York, NY,
     USA, 2009. ACM.