<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the Tenth Spring Researcher's Colloquium on Databases and Information Systems</institution>
          ,
          <addr-line>Veliky Novgorod, Russia, 2014</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Institute for System Programming (ISP) of the Russian Academy of Sciences</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Probabilistic Latent Semantic Analysis (PLSA) is an effective technique for information retrieval, but it has a serious drawback: it consumes a huge amount of computational resources, so it is hard to train this model on a large collection of documents. The aim of this paper is to improve time efficiency of the training algorithm. Two different approaches are explored: one is based on efficient finding of an appropriate initial approximation, the idea of another is that for the most of collection topics may be extracted from relatively small fraction of the data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Topic modeling is an application of machine learning
to text analysis. Topic modeling is useful for different
text analysis tasks, for example: document
categorization [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], spam detection [2], phishing detection [3] and
many other applications.
      </p>
      <p>One of the widespread algorithms is Probabilistic Latent
Semantic Analysis (PLSA), suggested by Thomas
Hofmann in [4].</p>
      <sec id="sec-1-1">
        <title>Generative model</title>
        <p>PLSA is based on generative model ”bag of words”:
every document is assumed to be a multinomial
distribution over topics. Every topic is a multinomial distribution
over words. Generation model may be defined as follow:
• For every position in document d i.i.d choose topic
t from distribution of topics by document
• Choose word w from topic t
The aim of topic modeling is to recover topics and
distribution of document by topics.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Topic modeling as optimization problem</title>
        <p>According to generative model one can estimate
probability to observe collection D as:
p(D) = Y Y X p(tjd)p(wjt)
(1)
d2D w2d t
Denote 'wt = p(wjt) and td = p(tjd). One may obtain
'wt and td as solution of optimization problem
with boundary
and
1.3
1.3.1</p>
        <p>L = X
d2D w2d</p>
        <p>X log X 'wt td ! max</p>
        <p>t
8t</p>
        <p>X 'wt = 1; 8d X td = 1
w w
8t; w 'wt
0; 8d; t
wt
0</p>
      </sec>
      <sec id="sec-1-3">
        <title>Topic modeling as matrix decomposition</title>
      </sec>
      <sec id="sec-1-4">
        <title>Kullback-Leibler divergence</title>
        <p>Kullback-Leibler divergence is a non-negative measure
of diffrence between two different probability
distribution:</p>
        <p>KL(pijjqi) =
n
X pi ln
i=1
pi
qi
Consider an empirical distribution p^i and some
parametric distribution qi = qi( ) which is used to explain p^i
. Easy to see that in this case minimization of KL–
divergence is equivalent to estimation of by
maximumlikelihood:
(2)
(3)
(4)
(5)
KL(pijjqi( )) =
! min
(6)
n
X pi ln
i=1</p>
        <p>pi
qi( )
n
, X pi ln(qi( )) ! max</p>
        <p>i=1</p>
        <p>Thus one can easily see that (2) equivalent to weighted
Kullback-Leibler divergence minimization:</p>
        <p>X ndKLw
d2D
ndw
nd jj X 'wt td
t2T
!
! min
;
(7)
where nwd– number of words w in document d, nd –
number of words in document d.
1.3.2</p>
      </sec>
      <sec id="sec-1-5">
        <title>Matrix decomposition</title>
        <p>Denote empirical distribution of words by document as
p^(w; d) = nwd . According to this notation one can
connd
sider the problem (2) as matrix decomposition:
where matrix F = (p^(w; d))W D is empirical
distribution of words by document, matrix = ('wt)W D
is distribution of words by topics and matrix =
( td)T D is distribution of documents by topics. Thus,
our optimization problem may be rewritten in
KullbackLeibler notation as</p>
        <p>KL(F;
) ! min
(9)
Thus PLSA may be observed as stochastic matrix
decomposition.
1.4</p>
      </sec>
      <sec id="sec-1-6">
        <title>Expectation-Maximization algorithm</title>
        <p>Unfortunately (2) has no analytical solution. Thus we
use Expectation - Maximization (EM) algorithm. This
algorithm consists of two steps:
1. Estimation of number ndwt of words w, produced
by topic t in document d. (E - step)
2. Optimization of distribution of documents by topics
and optimization of distribution of topics by words
relying on the ndwt values obtained during E - step
. (M - step)
One can estimate ndwt as follows:
ndwt =
nwdp(wjt)p(tjd)</p>
        <p>Pt p(wjt)p(tjd)
where nwd – number of words w in document d. Thus,
probability p(wjt) may be estimated as
p(wjt) = nwt =
nt</p>
        <p>Pd ndwt
Pw Pd ndwt
Where nwt – number of words w, produced by topic t:
nwt =</p>
        <p>ndwt
X
d2D
nt = X</p>
        <p>nwt
w2V
p(tjd) =
ntd
nd
ntd =</p>
        <p>ndwt
X
w2V
nt – number of words, produced by topic t
Similarly for p(tjd):
Where nd – number of words in document d, ntd –
estimated number of words in document d, produced by
topic t:</p>
        <p>As one can see, the asymptotic time of this algorithm
is O(D V T I) where D – the number of documents,
V – average number of distinct words in document, T
– number of topics and I – number of iterations until
convergence. Inference of the PLSA on a large dataset
requires a lot of time thus the methods of decreasing of
computation time are important.</p>
        <p>Number of topics and number of documents are defined
by application. Size of vocabulary (number of distinct
(10)
(11)
(12)
(13)
(14)
(15)
words) can be decreased by text normalization
(removing of stop-words, lowercasing, etc). Number of
iterations until convergence depends of initial approximation
of PLSA parameters and , so a good initial
approximation can reduce the number of iterations until
convergence. The current study presents an efficient
approach to find a beneficial initial approximation. The
other method of computation time reduction based on
idea that matrix may be obtained on small
representative part of documents collection.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The original algorithm was described in 1999 in [4].
Since 1999 numerous papers were devoted to PLSA,
but only a few of them are devoted to time efficiency
improvement. In [
        <xref ref-type="bibr" rid="ref2">5</xref>
        ] authors improve time efficiency by
parallelization of the algorithm using OpenMP. Authors
report 6 times speed-up on 8 CPU machine. Work [
        <xref ref-type="bibr" rid="ref3">6</xref>
        ]
improves the result of a previous work by using MPI.
But both of these studies try to solve the problem of time
efficiency purely by programming methods.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">7</xref>
        ] Farahat uses LSA for finding an initialization
for PLSA. LSA is based on SVD 1 matrix decomposition
in L2 norm and lacks probabilistic interpretation. PLSA
performs stochastic matrix decomposition based on
Kullback-Leibler divergence and has a simple
probability interpretation, but it inherits the problem of every
non-convex optimization algorithm: it may converge to
a local minimum instead of the global one. Combination
of LSA and PLSA leverages the best features of these
models: usage of LSA training result as an initial
approximation helps to avoid convergence to a poor
local minimum. But the problem of time efficiency is
not explored. In [
        <xref ref-type="bibr" rid="ref4">7</xref>
        ] it is shown that L2 norm usage is
appropriate to find an initialization for PLSA inference
algorithm, we will use this result in our work.
An idea of obtaining a distribution over topics for a
document not included in a collection that PLSA was
initially trained on was expressed in [
        <xref ref-type="bibr" rid="ref5">8</xref>
        ]. The author
suggests to perform this through EM-scheme holding
matrix fixed. However, he proposes this method only
for query processing but not for PLSA training speed-up.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Proposed approach</title>
      <p>In this work we present two different approach for
computational time reduction. One is based on finding initial
approximation and reduction of number of iteration to
convergence. The other is based on obtaining part on
representative sample, fixing part and then obtaining
on whole collection.
3.1</p>
      <sec id="sec-3-1">
        <title>Finding initial approximation</title>
        <p>In this work we do not use LSA nor clustering methods.
Instead we take a subset of our collection (for example
10%), apply PLSA to this sample and calculate an
initial approximation using obtained matrix .
Computa1SVD – Singular value decomposition, a factorization of a matrix
into the product of a unitary matrix, a diagonal matrix, and another
unitary matrix
tion time of PLSA is proportional to the number of
documents in collection and training PLSA on 10% part of
collection is at least ten times faster than on the whole
collection (per iteration).
3.1.1</p>
      </sec>
      <sec id="sec-3-2">
        <title>Taking a sample</title>
        <p>In order to take a representative sample we need to take
a random sample. The exact size of sample is not
important so we use a rather simple scheme: we include
documents independently with probability 10%. So we
take a representative part of collection and its size is
approximately 10% of size of the whole collection.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.1.2 Initial approximation of (words-topics)</title>
        <p>For training PLSA on the sample we use a random
initialization. Computation time is linear by the number of
the documents in the collection, so the process of training
turns out to be relatively fast. The obtained matrix part
can be used as initial approximation of matrix for the
whole collection. An issue of this approach is that some
words from the vocabulary do not occur in the sample
and every topic in matrix part has zero weight for these
words. If we would use matrix part as is, these
probabilities would stay zero on every step (10), (2). It would
have disastrous result for likelihood (or perplexity) of our
model:
likelihood = Y</p>
        <p>X p(wjt) t
0</p>
        <p>(16)
because some word w would have zero probability for
every topic. Thus some kind of smoothing is necessary.
In this work we use a trivial one:
1. add some constant for every position in every topic</p>
        <p>In this work we use
2. normalize:</p>
        <p>8t; w p(wjt) += const
const =</p>
        <p>1
vocabularySize
8t X p(wjt) = 1</p>
        <p>w</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.1.3 Initial approximation of (document-topics)</title>
        <p>During the previous step we found an initial
approximation for matrix (words by topic). Now we have to find
an initial approximation for matrix (documents-topics)
given . Its every column d: can be found as a solution
of the following optimization problem:
d: = arg max P (dj ) = arg max Y</p>
        <p>X p(wjt) wt
But our aim is to decrease the training time and
computation method of solving this problem is not fast enough.
We propose to find d: in the norm L2. The solution in
L2 would not be a solution in our space with
Kullback-Leibler divergence, but we are not looking for the exact
solution, but for an appropriate initialization.</p>
        <p>Assume that all words are replaced by their serial
numbers in the vocabulary. Let us consider topics and
documents as vector in RV , where V stands for vocabulary
size. i th coordinate in vector-document represents the
number of times a word with the number i occurs in the
document:</p>
        <p>d~(i) = #(word i occur in document d)
For topic-vector i th coordinate shows probability to
generate word i from this topic t:</p>
        <p>~t(i) = p(ijt)
Consider vector space, formed by topic-vectors. Topics
form basis in this space. One can find an initial
approximation for distribution of documents d by topics as
orthogonal projection to this space:
Where
For computation efficiency we perform
orthogonalization and normalization of basis f~tg ! f~t0g, where
8i 6= j(t0i; t0j ) = 0 and 8i(t0i; t0i) = 1. It allows us to
find the projection faster and simpler:</p>
        <p>d~ = d~? + d~k
~
8t 2</p>
        <p>(d~?; ~t) = 0
d~k = X
~
it0i
i
where ~t0i – i-th vector of the orthonormal basis
(d~; ~t0i) = (d~k; ~t0i) = (~t0i; X
i~t0i) =
i
Where scalar product is defined as follows
(~x; ~y) =</p>
        <p>i
V
X xi yi
i=1
Document-vector expansion in the orthonormal basis is
obtained, so one can return to topics basis and perform
normalization as follows:</p>
        <p>X dt = 1
t2
Where dt – weight of topic t in document d</p>
        <p>Due to the nature of L2 norm some topic weight may
be too small or even negative, so smoothing is necessary.
We do it analogously to the previous subsection:
1. Replace negative weight by zero
2. add some constant for every weight. In this paper
we use
const =</p>
        <p>1
numberOf T opics
3. perform normalization
8d X dt = 1
(19)
(20)
(21)
(22)
(23)
(24)
This approach is based on the fact that matrix may be
found by training PLSA on representative subset D0
D. The rows of the matrix may be obtained through
the following constrained optimization problem
independently for each document d 2 D:</p>
        <p>w2W
L( d) = X ndw ln X ^wt td ! max
d
t2T
X td = 1
t2T
td</p>
        <p>0
This approach consists of the following steps:
1. Take a random subset of documents D0 2 D of
sufficient size s
2. Obtain ^ through training PLSA on D0 using EM
algorithm described in [4]
3. Obtain d through solving optimization problem
(25), (26), (27) for each document d 2 D
The third step needs some explanation.</p>
        <p>We solve this problem with EM-algorithm.</p>
        <p>E-step estimate the probabilities p(tjd; w) as
p(tjd; w) = P
2T
wt td
w
d</p>
        <p>M-step In order to solve the problem (25), (26), (27)
temporary omit the non-negativeness constraint (27) –
we will see that the solution is non-negative.</p>
        <p>Lagrange function for problem (25), (26) takes the
form:
L( d) = X ndw ln X ^wt td
w2W
t2T
(X
t2T
td
1) (29)
Take a derivative:
w2W</p>
        <p>^wt
ndw P ^wt td
t2T
= 0
(30)
w2W
X
w2W t2T</p>
        <p>Move to the right part and multiply both sides by
td and according to (28)</p>
        <p>X ndwp(wjt; d) =
td
Now sum both sides by every t 2 T</p>
        <p>X ndwp(wjt; d) =</p>
        <p>From equation (31) obtain td and substitute
(32):
td =</p>
        <p>X
w2W
ndw P
p(wjt; d)</p>
        <p>P nd!p(!j ; d)
!2W 2T
(25)
(26)
(27)
(28)
(31)
(32)
from</p>
        <p>The denominator is independent of t, thus
td /</p>
        <p>X ndwp(wjt; d)
w2W
(33)</p>
        <p>Primal feasibility is easily verifiable by td
summation by t.</p>
        <p>Dual feasibility follows from (32) and probability
non-negativeness.</p>
        <p>So this point satisfies the Karush-Kuhn-Tucker
conditions.</p>
        <p>Also, one can easily see that td 0, thus we have
found a solution of the problem (25), (26), (27).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        We conduct two kinds of experiments: we evaluate
perplexity [
        <xref ref-type="bibr" rid="ref6">9</xref>
        ] for our approaches and for classical PLSA
and compare classification performance on topics
distributions, obtained by PLSA and by our approaches.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Datasets</title>
        <p>Both experiments were conducted on three datasets:
tweets, news articles, abstracts of scientific papers.
• Twitter dataset</p>
        <p>Twitter dataset contains tweets, posted by 15000
Twitter users, written in English. We merge all
tweets, posted by single user into a single
document. Every document contains approximately
1000 tweets. Documents with less than 50 words
are omitted.
• The 20 Newsgroups data set 2</p>
        <p>The 20 Newsgroups dataset is often used for topic
modeling testing on text categorization. It contains
short news articles on one of twenty newsgroups.
It is a collection of approximately 20,000
newsgroup documents, partitioned (nearly) evenly across
20 different newsgroups.
• arxive 3</p>
        <p>The third data set consists of abstracts of scientific
articles. It consists of approximately 900000
abstracts from 6 areas: Physics, Mathematics,
Computer Science, Quantitative Biology, Quantitative
Finance, Statistics. Distribution of articles by
areas is not uniform: Physics contains 600 thousands,
Mathematics contains 270 thousands and
Quantitative Biology only 5 thousands. Abstracts with less
than 20 words are omitted. For experiments with
fixed we omit small area and take only Physics,
Mathematics and Computer Science.</p>
        <p>Some text normalization is performed: stoplisting,
lowercasing, rare words removing (ones that occur less than
5 times in the whole collection).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Initial approximation</title>
        <p>Four types of initial approximation are compared:
2http://qwone.com/~jason/20Newsgroups/
3http://arxiv.org/
• Random initial approximation for matrix (words
by topic) and matrix (document by topic).
Denoted ”randomly”.
• Calculate an initial approximation for matrix on
sample. Use random initial approximation for
matrix . Denoted ”phi”.
• Calculate an initial approximation for matrix on
sample. Use random initial approximation for
matrix . Denoted ”theta”.
• Calculate an initial approximation for matrix
matrix on sample. Denoted ”full”.
and
4.2.1</p>
      </sec>
      <sec id="sec-4-3">
        <title>Perplexity depending on initial approximation</title>
        <p>We evaluate the dependence of perplexity on different
types of initial approximation. During these experiments
the number of iterations is fixed and equals 100. The
number of topics is 25 for every experiment.</p>
        <p>In figure 1, figure 2 and figure 3 one can see perplexity
depending on number of iterations for different datasets
and different types of initial approximation. The keys are
given above in the beginning of Section 4.2</p>
        <p>As one can see all the types of initial approximation
decrease perplexity of model. Model with initial, finding
by our approach converges faster then model with
random initial approximation. The same behavior is
observed for every data set. Perplexity values eventually
obtained in 100 iterations for different datasets and
different types of initialization are presented in table 1.
In these experiments we evaluate training time
depending on initial approximation. We perform iterations until
the stop criterion is satisfied: change of perplexity is less
than 1 five times in a row. Choosing a threshold is not
the aim of this work. Similar results were observed for a
wide range of thresholds. Difference in dispersion is less
than standard deviation. Results for different datasets
and different types of initialization are presented in
tables 2, 3 and 4. (It include all training time: training on
sample, orthogonalization, finding initial approximation
for , training on whole collection)</p>
        <p>
          It can be seen that our approach decreases calculation
time 1.5-2 times in every data set. Perplexity in
models with initial approximation, achieved by our approach
is less or equal to the perplexity of model with random
initial distribution.
and
4.3
4.3.1
Perplexity inspection is a common way to compare
different topic models [
          <xref ref-type="bibr" rid="ref6">9</xref>
          ] [
          <xref ref-type="bibr" rid="ref7">10</xref>
          ] [
          <xref ref-type="bibr" rid="ref8">11</xref>
          ]. We compare our model
with varying size of training subset with PLSA on
different datasets. One can see the results on Figures 4, 5 and
6 .
        </p>
        <p>As one can see perplexity values for PLSA and for
our approximation are nearly equal, especially for such a
large collections as arxiv or Twitter dataset. Worth
mentioning that all the perplexity curves have a ”horizontal
tail” – matrix is not inferred better on a larger sample.
It means that the size of a sample needed to infer
matrix does not depend on the size of a dataset. This fact is
especially significant for training on huge datasets.
4.3.2</p>
      </sec>
      <sec id="sec-4-4">
        <title>Text categorization</title>
        <p>The other way to compare different topic models is
application task, for example document categorization.
In these experiments we classify news articles by
categories and Twitter users by gender treating topic
distributions as features. We obtain topic distributions by
training PLSA on the whole collection and by our
approximation with topics, trained on varying fraction of
the whole collection. Then we estimate the quality of
classification by cross-validation with 10 folds. In both
experiments we use 20 topics. For classification we use
random forest classifier from package scikit-learn 4. The
results can be found on Figures 7 and 8. Majority to
minority class ratio for Twitter dataset is 1.17</p>
        <p>As one can see our approximation exhibits
comparable result if the training dataset is not too small. It works
noticeably better for a large collection. Another
important observation is that in all the experiments the curve
reaches a plateau, it confirms that matrix may be found
by training PLSA on a representative subset.
4.3.3</p>
      </sec>
      <sec id="sec-4-5">
        <title>Time efficiency</title>
        <p>One of the aims of our work is time efficiency
improvement. We compare training time for PLSA and for our
approximation with jD0j = 14 jDj. In Table 5 calculation
time for PLSA and for our approximation (time to train
PLSA on the subset is included) are presented.</p>
        <p>Another important characteristic is average time to
process one document with our approximation and
PLSA. Obtained values and speed-up in comparison to
4http://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html
time needed to train original PLSA on a single document
in average are given in Table 6 (time to train PLSA on
the subset is not included).
We develop two methods for computation time reduction
one is based on finding appropriate initial approximation
and other is based on fixing matrix and tested these
methods on three different datasets. Method, based on
finding initial approximation demonstrate the same
behavior on every used dataset, the calculation time and
number of iterations to converge is decreased, yet the
quality of topic model does not decrease. We confirm
that transition from Kullback-Leibler divergence to L2
norm is appropriate to find an initial approximation for
PLSA.</p>
        <p>Method, based on finding initial approximation
demonstrate more significant speed-up, but precision is drop.
However drop of precision is not significant, especially
on large datasets.
[2] Cailing Dong and Bin Zhou. Effectively
detecting content spam on the web using topical diversity
measures. Web Intelligence and Intelligent Agent
Technology, IEEE/WIC/ACM International
Conference on, 1:266–273, 2012.
[3] Venkatesh Ramanathan and Harry Wechsler.
phishgillnet–phishing detection methodology
using probabilistic latent semantic analysis, adaboost,
and co-training. EURASIP Journal on Information
Security, 2012(1):1, 2012.
[4] Thomas Hofmann. Probabilistic latent semantic
indexing. In Proceedings of the 22Nd Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’99,
pages 50–57, New York, NY, USA, 1999. ACM.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Timothy</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Rubin</surname>
            , America Chambers, Padhraic Smyth, and
            <given-names>Mark</given-names>
          </string-name>
          <string-name>
            <surname>Steyvers</surname>
          </string-name>
          .
          <article-title>Statistical topic models for multi-label document classification</article-title>
          .
          <source>Mach</source>
          . Learn.,
          <volume>88</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>157</fpage>
          -
          <lpage>208</lpage>
          ,
          <year>July 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Chuntao</given-names>
            <surname>Hong</surname>
          </string-name>
          , Yurong Chen, Weimin Zheng, Jiulong Shan, Yurong Chen, and Yimin Zhang.
          <article-title>Parallelization and characterization of probabilistic latent semantic analysis</article-title>
          .
          <source>In Parallel Processing</source>
          ,
          <year>2008</year>
          . ICPP '
          <volume>08</volume>
          . 37th International Conference on, pages
          <fpage>628</fpage>
          -
          <lpage>635</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Raymond</given-names>
            <surname>Wan</surname>
          </string-name>
          , VoNgoc Anh, and
          <string-name>
            <given-names>Hiroshi</given-names>
            <surname>Mamitsuka</surname>
          </string-name>
          .
          <article-title>Efficient probabilistic latent semantic analysis through parallelization</article-title>
          .
          <source>In GaryGeunbae Lee</source>
          , Dawei Song,
          <string-name>
            <surname>Chin-Yew</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Akiko Aizawa, Kazuko Kuriyama, Masaharu Yoshioka, and Tetsuya Sakai, editors,
          <source>Information Retrieval Technology</source>
          , volume
          <volume>5839</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>432</fpage>
          -
          <lpage>443</lpage>
          . Springer Berlin Heidelberg,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Ayman</given-names>
            <surname>Farahat</surname>
          </string-name>
          . F.r.:
          <article-title>Improving probabilistic latent semantic analysis using principal component analysis</article-title>
          .
          <source>In In: Eleventh Conference of the European Chapter of the Association for Computational Linguistics (EACL</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Hofmann</surname>
          </string-name>
          .
          <article-title>Probabilistic latent semantic indexing</article-title>
          .
          <source>In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '99</source>
          , pages
          <fpage>50</fpage>
          -
          <lpage>57</lpage>
          , New York, NY, USA,
          <year>1999</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Anna</given-names>
            <surname>Potapenko</surname>
          </string-name>
          and
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Vorontsov</surname>
          </string-name>
          .
          <article-title>Robust plsa performs better than lda</article-title>
          .
          <source>In ECIR</source>
          , pages
          <fpage>784</fpage>
          -
          <lpage>787</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [10]
          <string-name>
            <surname>David</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>Andrew Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>Michael I.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          ,
          <year>March 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Hanna</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wallach</surname>
            , Iain Murray, Ruslan Salakhutdinov, and
            <given-names>David</given-names>
          </string-name>
          <string-name>
            <surname>Mimno</surname>
          </string-name>
          .
          <article-title>Evaluation methods for topic models</article-title>
          .
          <source>In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09</source>
          , pages
          <fpage>1105</fpage>
          -
          <lpage>1112</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>