<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distributed LDA based Topic Modeling and Topic Agglomeration in a Latent Space</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gopi Chand Nutakki</string-name>
          <email>g0nuta01@louisville.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Behnoush Abdollahi</string-name>
          <email>b0abdo03@louisville.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olfa Nasraoui</string-name>
          <email>olfa.nasraoui@louisville.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mahsa Badami</string-name>
          <email>m0bada01@louisville.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenlong Sun</string-name>
          <email>w0sun005@louisville.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowlede Discovery &amp; Web Mining Lab, University of Louisville</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>We describe the methodology that we followed
to automatically extract topics corresponding
to known events provided by the SNOW 2014
challenge in the context of the SocialSensor
project. A data crawling tool and selected
filtering terms were provided to all the teams.
The crawled data was to be divided in 96
(15-minute) timeslots spanning a 24 hour
period and participants were asked to produce a
fixed number of topics for the selected
timeslots. Our preliminary results are obtained
using a methodology that pulls strengths from
several machine learning techniques, including
Latent Dirichlet Allocation (LDA) for topic
modeling and Non-negative Matrix
Factorization (NMF) for automated hashtag
annotation and for mapping the topics into a latent
space where they become less fragmented and
can be better related with one another. In
addition, we obtain improved topic quality when
sentiment detection is performed to partition
the tweets based on polarity, prior to topic
modeling.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>The SNOW 2014 challenge was organized within the
context of the SocialSensor project1, which works on
developing a new framework for enabling real-time
multimedia indexing and search in the Social Web.
The aim of the challenge was to automatically extract
topics corresponding to known events that were
prescribed by the challenge organizers. Also provided,
was a data crawling tool along with several Twitter
filter terms (syria, ukraine, bitcoin, terror). The crawled
data was to be divided in a total of 96 (15-minute)
timeslots spanning a 24 hour period, with a goal of
extracting a fixed number of topics in each timeslot.
Only tweets up to the end of the timeslot could be
used to extract any topic. In this paper, we focuse on
the topic extraction task, instead of input data
filtering, or presentation of associated headline, tweets and
image URL, because this was one of the activities
closest to the ongoing research [AN12, HN12, CBGN12]
on multi-domain data stream clustering in the
Knowledge Discovery &amp; Web Mining Lab at the University of
Louisville. To extract topics from the tweets crawled</p>
      <sec id="sec-2-1">
        <title>1SocialSensor: http://www.socialsensor.eu/</title>
        <p>in each time slot, we use a Latent Dirichlet Allocation
(LDA) based technique. We then discover latent
concepts using Non-negative Matrix Factorization (NMF)
on the resulting topics, and apply hierarchical
clustering within the resulting Latent Space (LS) in order to
agglomerate these topics into less fragmented themes
that can facilitate the visual inspection of how the
different topics are inter-related. We have also
experimented with adding a sentiment detection step prior
to topic modeling in order to obtain a polarity sensitive
topic discovery, and automated hashtag annotation to
improve the topic extraction.
Latent Dirichlet Allocation (LDA) is a Bayesian
probabilistic model for text documents. It assumes a
collection of K topics where each topic defines a
multinomial over the vocabulary, which is assumed to have
been drawn from a Dirichlet process [BNJ03][HBB10].</p>
        <p>Given the topics, LDA assumes the generative process
for each document d, shown in Algorithm 1, where the
notation is listed in Table 1. Equation 1 gives the joint
distribution of a topic mixture θ, a set of N topics z,
and a set of N words w for parameters α and β.</p>
        <p>N
p (θ, z, w|α, β) = p (θ|α) Y p (zn|θ) p (wn|zn, β) (1)</p>
        <p>n=1</p>
        <p>Integrating over θ and summing over z, we obtain
the marginal distribution of a document [BNJ03]:
p (w|α, β) =</p>
        <p>p (θ|α)</p>
        <p>Taking the product of the marginal probabilities of
single documents, the probability of a corpus D can
be obtained:
It is possible to distribute non-collapsed Gibbs
sampling, because sampling of zdi can happen
indepen! dently given θd and φk, and thus can be done
concurΠnN=1 X p (zn|θ) p (wn|zn, β) dθ rently. In a non-collapsed Gibbs sampler, one samples
zn zdi given θd and φk, and then θd and φk given zdi. If
individual documents are not spread across different
processors, one can marginalize over just θd, since θd is
processor-specific. In this partially collapsed scheme,
Algorithm 1 Latent Dirichlet Allocation.</p>
        <p>Input: A document collection, hyper-parameters α</p>
        <p>and β.</p>
        <p>Output: A list of topics.</p>
        <p>1. Draw a distribution over topics, θd ∼ Dir(α)
2. For Each word i in the document:
3.
4.</p>
        <p>Draw a topic index zdi ∈ {1, · · · , K}</p>
        <p>from the topic weights zdi ∼ θd.</p>
        <p>Draw the observed word wdi</p>
        <p>from the selected topic, wdi ∼ βzdi
p (D|α, β) =
ΠdM=1
p (θd|α)</p>
        <p>!
ΠnN=d 1 X p (zdn|θd) p (wdn|zdn, β) dθd</p>
        <p>zdn</p>
        <p>The posterior is usually approximated using Markov
Chain Monte Carlo (MCMC) methods or variational
inference. Both methods are effective, but face
significant computational challenges in the face of massive
data sets. For this reason, we concentrated on a
distributed version of LDA which is summarized in the
next section.
2.2</p>
        <p>Distributed Algorithms for LDA
the latent variables zdi on each processor can be
concurrently sampled where the concurrency is over
processors. The slow convergence of partially collapsed
and non-collapsed Gibbs samplers (due to the strong
dependencies between the parameters and latent
variables) has led to devising distributed algorithms for
fully collapsed Gibbs samplers [NASW09][YMM09].</p>
        <p>Given M documents and P processors, with
approximately MP = MP documents, distributed on each
processor p, the M documents are partitioned into x =
{x1, · · · , xp, · · · , xP } and z = {z1, · · · , zp, · · · , zP }
being the corresponding topic assignments, where
processor p stores xp, the words from documents j =
(p − 1) MP + 1, · · · , pMP and zP , the corresponding
topic assignments. Topic-document counts Ndk are
likewise distributed as Ndkp. The word-topic counts
Nwk are also distributed, with each processor p
keeping a separate local copy Nwkp.</p>
        <p>Algorithm 2 Standard Collapsed Gibbs Sampling.</p>
        <p>LDAGibbsItr( |xp|, zp, Ndkp, Nwkp, α, β):
1. For Each d ∈ {1, · · · , M }</p>
        <p>For Each i ∈ {1, · · · , Ndkp}
v ← xdpi, Tdpi ← Ndkpi
For Each j ∈ {1, · · · , Tdkpi}
ˆ
k ← zdpij
Ndkp ← Ndkp − 1, Nwkp ← Nwkp − 1
For k = 1 to K
ρk ← ρk−1 + (Ndkp + α)</p>
        <p>× (Nkwp + β) / (Pw0 Nw0k) + N β
x ∼ U nif ormDistribution(0, ρk)
k ← BinarySearch kˆ : ρkˆ−1 &lt; x &lt; ρˆ
ˆ</p>
        <p>k
Ndkˆp ← Ndkˆp + 1,Nwkˆp ← Nwkˆp + 1</p>
        <p>ˆ
zdpij ← k
2.
3.
4.
5.
6.
7.</p>
        <p>Although Gibbs sampling is a sequential process,
given the typically large number of word tokens
compared to the number of processors, the dependence of
zij on the update of any other topic assignment zi0j0 is
likely to be weak, thus relaxing the sequential sampling
constraint. If two processors are concurrently
sampling, but with different words in different documents,
then concurrent sampling will approximate sequential
sampling. This is because the only term affecting the
order of the update operations is the total word-topic
counts Pw Nwk. Algorithm 3 shows the pseudocode</p>
        <sec id="sec-2-1-1">
          <title>1. Repeat</title>
          <p>list of M
{x1, · · · , xp, · · · , xP }
Output: z = {z1, · · · , zp, · · · , zP }</p>
          <p>Approximate</p>
          <p>Distributed
documents,
x
=
For each processor p in parallel do</p>
          <p>Copy global counts: Nwkp ← Nwk
Sample zp locally:</p>
          <p>LDAGibbsItr(xp,zp,Ndkp,Nwkp,α,β)
//
Alg: 2</p>
          <p>Synchronize
Update global counts:</p>
          <p>Nwk ← Nwk + Pp (Nwkp − Nwk)
7. Until termination criterion is satisfied
2.
3.
4.
5.</p>
          <p>6.
3
3.1
of the AD-LDA algorithm which can terminate after
a fixed number of iterations, or based on a suitable
MCMC convergence metric. The AD-LDA algorithm
samples from an approximation to the posterior
distribution by allowing different processors to concurrently
sample topic assignments on their local subsets of the
data. AD-LDA works well empirically and accelerates
the topic modeling process.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Topic Extraction Methodology</title>
      <sec id="sec-3-1">
        <title>Data Preprocessing</title>
        <p>The dataset consists of tweets that were acquired from
the Twitter servers by continuous querying using a
wrapper for the Twitter API over a period of 24 hours.
The batch of tweets are acquired in raw JSON2 format.
Various properties of the tweet such as the hashtags,
URLs, creation time, counts for retweets and favorites,
and other user information including the encoding and
language are extracted. The hashtags can provide a
good source for creating discriminating features and
they were folded as terms into the bag of words model
for each tweet where they were present (without the
’#’ prefix). The URLs can also later provide a method
to achieve topic summarization.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Topic Extraction Stages</title>
        <p>The technique assumes a real time streaming data
input and is replicated using process calls to the
storage records containing the tweets. For AD-LDA, each
2JSON: JavaScript Object Notation, is a text-based open
standard designed for human-readable data interchange
tweet is considered as a single document. Figure 1
shows the steps performed to extract the topics in each
window or time slot. The procedure starts with the
extraction of key information from the Twitter JSON,
then the tweet text and other properties are used to
extract topics. The topic extraction is performed using
the following steps:
1. The documents are stripped of non-English
characters and are converted to lowercase. The stop
words are retained for the context information
(especially for sentiment detection).
2. Groups of documents are assembled into windows
based on their timestamp. A sliding window’s
width is equal to three consecutive time slots
ending in the current time slot.
3. AD-LDA technique is performed on a 20 node
cluster. From each sliding window iteration a
total of 1000 topics are extracted, this higher value
help in extracting finer topics.
4. The topics are ranked based on the proportion of
tweets assigned to the topic in the given window,
and then can be clustered/merged together to
organize them into more general topic groups.
The jsoup open source HTML parser3 was used to
extract multimedia content such as images and metadata
from the URLs extracted from the tweets. The
headlines are part of the metadata while the keywords are
obtained from the topic modeling itself as the terms
with highest probability in the topic.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Topic Extraction with AD-LDA and Sentiment Labels</title>
      <p>The AD-LDA technique with Gibbs sampling, along
with automatically extracted sentiment labels can also
be used to extract polarity-sensitive topics. Using
sentiment labels may improve the quality of the topics as
it results in finer topics. Figure 1 depicts the general
flow within the used methodology. A weighted Naive</p>
      <sec id="sec-4-1">
        <title>3http://jsoup.org/</title>
        <p>Bayes classifier [LGD11] trained with labeled tweet
samples4 and a set of labeled tokens5 with known
sentiment polarity can be used to extract the sentiment
levels. The tweets are then regrouped based on the
sentiment level and the topic modeling is applied on
each group, resulting in topics that are confined to one
sentiment, as illustrated in Table 2.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Agglomeration in a Latent 5</title>
      <p>5.1</p>
    </sec>
    <sec id="sec-6">
      <title>Topic</title>
    </sec>
    <sec id="sec-7">
      <title>Space</title>
      <sec id="sec-7-1">
        <title>Discovering Latent Factors Among the</title>
      </sec>
      <sec id="sec-7-2">
        <title>Discovered Topics Using Non-negative</title>
      </sec>
      <sec id="sec-7-3">
        <title>Matrix Factorization (NMF)</title>
        <p>Because initial topic modeling generated a high
number of topics (1000 topics per window), that were
furthermore very sparse in terms of the descriptive terms
within them, these topics were hard to interpret and
could benefit from a coarser, less fragmented
organization. One way to fix this problem was to merge
the topics based on a conceptual similarity by
applying Non-negative Matrix Factorization (NMF) [LS99].
Because the topics by words matrix is a very sparse
matrix, we used NMF to project the topics onto a
common lower-dimensional latent factors’ space. NMF
takes as input the matrix X of n topics by m words
(as binary features) and decomposes it into two
factor matrices (A and B) which represent the topics and
words, respectively, in a kf -dimensional latent space,
as follows:</p>
        <p>Xn×m ' An×kf BTm×kf
(2)
4https://github.com/ravikiranj/twitter-sentiment-analyzer
5https://github.com/linron84/JST</p>
      </sec>
      <sec id="sec-7-4">
        <title>Positive Sentiment</title>
        <p>optimistic ukraine antiwar nonintervention
syria refugees about education children million
future technology bitcoins value law accelerating</p>
      </sec>
      <sec id="sec-7-5">
        <title>Negative Sentiment</title>
        <p>horrible building badge hiding ukraine yanukovych
syria yarmouk camp crisis food waiting unrest shocking
cnn protocols loss gox bitcoin fault</p>
        <p>where kf is the approximated rank of matrices A
and B, and is selected such that kf &lt; min(m, n), so
that the number of elements in the decomposition
matrices is far less than the number of elements of the
original matrix: nkf + kf m nm.</p>
        <p>Topics factor (A) can then be used to find the
similarity between the topics in the new latent space
instead of using the original space of original terms. the
obtained similarity matrix from the NMF factors can
finally be used to cluster the topics.</p>
        <p>To find A and B, the Frobenius norm of errors
between the data and the approximation is optimized, as
follows</p>
        <p>JNMF = ||E||2F = || X − ABT 2
||F
(3)</p>
        <p>Several algorithms have been proposed in the
literature to minimize this cost. We used an Alternating
Least Square (ALS) method [PT94] that iteratively
solves for the factors, by assuming that the problem is
convex in either one of the factor matrices alone.
Input: Data matrix X, number of factors kf
Output: optimal matrices A and B
1. Initialize matrix A (for example randomly)</p>
      </sec>
      <sec id="sec-7-6">
        <title>2. Repeat</title>
        <p>(a) Solve for B in the equation: AT AB = AT X
(b) Project solution onto non-negative matrix
subspace: set all negative values in B to
zeros
(c) Solve for A in the equation: BBT AT =</p>
        <p>BXT
(d) Project solution onto non-negative matrix
subspace: set all negative values in A to
zeros
3. Until Cost function decrease is below threshold
5.2</p>
      </sec>
      <sec id="sec-7-7">
        <title>Topic Organization Stages: Topic Fea</title>
        <p>ture Extraction, Latent Space
Computation using NMF, Latent Space-based</p>
      </sec>
      <sec id="sec-7-8">
        <title>Topic Similarity Computation, and Hierarchical Clustering</title>
        <p>In the following, we summarize the steps that are
applied post-discovery of the topics, in order to generate
a hierarchical organization from the sparse topics.
1. Preprocessing of the topic vectors: For each
window, the topic-word matrix (Xn×m) is extracted
from the final topic modeling results. The
features are the top words in a topic, and they are
binary (1 if a topic has the word in question and
0 otherwise).
2. Latent Factor Discovery using NMF : The
topicword data was normalized before running NMF.
The latter produces two factors (A and B), where
n, kf and m are the number of topics, latent
factors and words, respectively. Our main goal was
to compute the Matrix A, also called the topics
basis factor, which transfers the topics to the
latent space. Choosing the number of factors), kf ,
has an impact on the results,. After trial and
error, we chose kf = 30.
3. Generating the topic-similarity matrix in the
latent space: The computed topic basis matrix (A)
was used to obtain the similarity in lieu of the
original topic vectors. The normalized inner
product of the matrix A and its transpose was
calculated for this purpose. Normalization of the
product is equivalent to computing the Cosine
similarity between topic pairs. The resulting matrix
contains the pairwise similarity between each pair
of topics within the latent space.
4. Hierarchical Clustering of the latent
spaceprojected topics based on the new pairwise
similarity scores computed in Step 3 : we experimented
with several linkage strategies such as single and
average linkage. The latter was chosen as optimal.
5.3</p>
      </sec>
      <sec id="sec-7-9">
        <title>Automated Hashtag Annotation</title>
        <p>We have also experimented with a simple tag
completion or prediction step prior to topic modeling.
Annotation for a given tweet is determined by finding
the top frequent tags associated with the KLS 6
nearest neighboring tweets in the NMF-computed Latent
Space to the given tweet. Once the tags are
completed, they are used to enrich the tweets before topic
modeling. Of course, only the tweets bag of word
descriptions in a given window are used to compute the
NMF for that window’s topic modeling. The
annotation generally resulted in lower Perplexity of the
extracted topic models, as shown in Figure 6.
6
6.1</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Results</title>
      <sec id="sec-8-1">
        <title>Distributed LDA-based Topic Modeling</title>
        <p>Figure 2 shows7 a sample of the topic clusters’
hierarchy extracted from the initial window and
without NMF-based latent space projection of the
topics. The clusters are of debatable quality.
Perplexity is a common metric to evaluate language
models [BL06][BNJ03]. It is monotonically
decreasing in the likelihood of the test data, with a lower
6we report results for KLS = 5
7Refer to the electronic version of the paper for clarity.
of keywords in dthdocument, the perplexity given in
Equation 4, will be lower for a better topic model.
Figure 5 shows the perplexity trends, suggesting that
more topics result in lower (thus better) perplexity.
Also, irrespective of the number of topics,
AD-LDAbased topic modeling can extract topics of good
quality.</p>
        <p>


perplexity (D0) = exp 



−</p>
        <p>PT
d=1 ln p
→(d)
w</p>
        <p>|→−α, β
PT
d=1 Nd







(4)
6.2</p>
      </sec>
      <sec id="sec-8-2">
        <title>Sentiment Based Topic Modeling</title>
        <p>Table 2 shows a subset of topics, extracted from the
positive and negative sentiment groups of tweets, and
these tend to be more refined than the standard
unsentimental topics. From the initial window, 1000
topics were extracted in the same way as the Distributed
LDA, however topic modeling was preceded by a
sentiment classifier that classifies the tweets based on their
sentiment (positive or negative). Although positive
and negative topics still share a few keywords, they
are clearly divided by sentiment.
6.3</p>
      </sec>
      <sec id="sec-8-3">
        <title>Topic Clustering in the Latent Space</title>
        <p>Figure 3 shows the topic clusters created using the
latent space-projected features extracted using NMF.
The clusters in Figure 3 seem to have better quality
compared to the clusters in Figure 2 because of the
more accurate capture of pairwise similarities between
topics in the conceptual space. Figure 4 shows the
clustering of the top 10 topics for a series of 6 windows,
showing how the agglomeration can consolidate the
topics discovered at different time slots, helping avoid
excessive fragmentation throughout the stream’s life.
6.4</p>
      </sec>
      <sec id="sec-8-4">
        <title>Automated Hashtag Annotation</title>
        <p>Tweet data is very sparse and not every tweet has
valuable tags. To overcome this weakness, we applied an
NMF-based automated tweet annotation before topic
modeling. Adding the predicted hashtags to the tweets
enhanced the topic modeling. The automated tag
annotation, described in Section 5.3, generally resulted
in lower Perplexity of the extracted topic models, as
shown in Figure 6, suggesting that the auto-completed
tags did help complete some missing and valuable
information in the sparse tweet data, thus helping the
topic modeling.
Using Distributed LDA topic modeling, followed by
NMF and hierarchical clustering within the resulting
Latent Space (LS), helped organize the topics into less
fragmented themes. Sentiment detection prior to topic
modeling and automated hashtag annotation helped
improve the learned topic models, while the
agglomeration of topics across several time windows can link
the topics discovered at different time windows. Our
focus was on the topic modeling and organization
using the simplest (bag of words) features.
Specialized twitter feature extraction and selection methods,
such as the ones surveyed and proposed by Aiello et
al. [APM+13], have the potential to improve our
results, a direction we will explore in the future.
Another direction to explore is the news domain specific,
user-centered approach, discussed in [SNT+14] and a
more expanded use of automated annotation to
support topic extraction and description.
8</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgements</title>
      <p>We would like to thank the organizers of the SNOW
2014 workshop, in particular the members of the
SocialSensor team for their leadership in all the phases
of the competition.
[AN12]
[BL06]</p>
      <p>David Blei and John Lafferty. Correlated
topic models. Advances in neural
information processing systems, 18:147, 2006.
[BM10]
[BNJ03]
[HBB10]
[HN12]
[LGD11]
[LH09]</p>
      <p>David M Blei and Jon D McAuliffe.
Supervised topic models. arXiv preprint
arXiv:1003.0783, 2010.</p>
      <p>David M Blei, Andrew Y Ng, and Michael I
Jordan. Latent dirichlet allocation. the
Journal of machine Learning research,
3:993–1022, 2003.
[CBGN12] Juan C Caicedo, Jaafar BenAbdallah,
Fabio A González, and Olfa Nasraoui.</p>
      <p>Multimodal representation, indexing,
automated annotation and retrieval of image
collections via non-negative matrix
factorization. Neurocomputing, 76(1):50–60,
2012.</p>
      <p>Matthew Hoffman, David M Blei, and
Francis Bach. Online learning for latent
dirichlet allocation. Advances in Neural
Information Processing Systems, 23:856–
864, 2010.</p>
      <p>Basheer Hawwash and Olfa Nasraoui.</p>
      <p>Stream-dashboard: a framework for
mining, tracking and validating clusters in
a data stream. In Proceedings of the
1st International Workshop on Big Data,
Streams and Heterogeneous Source
Mining: Algorithms, Systems, Programming
Models and Applications, pages 109–117.</p>
      <p>ACM, 2012.</p>
      <p>Chang-Hwan Lee, Fernando Gutierrez,
and Dejing Dou. Calculating feature
weights in naive bayes with kullback-leibler
measure. In Data Mining (ICDM), 2011
IEEE 11th International Conference on,
pages 1146–1151. IEEE, 2011.</p>
      <p>Chenghua Lin and Yulan He. Joint
sentiment/topic model for sentiment analysis.</p>
      <p>In Proceedings of the 18th ACM
conference on Information and knowledge
management, pages 375–384. ACM, 2009.
[LHAY07] Yang Liu, Xiangji Huang, Aijun An, and
Xiaohui Yu. Arsa: a sentiment-aware
model for predicting sales performance
using blogs. In Proceedings of the 30th
annual international ACM SIGIR conference
on Research and development in
information retrieval, pages 607–614. ACM, 2007.
[LS99]</p>
      <p>Daniel D Lee and H Sebastian Seung.</p>
      <p>Learning the parts of objects by
nonnegative matrix factorization. Nature,
401(6755):788–791, 1999.
[McC]
[PT94]</p>
      <p>Mallet: A machine learning for language
toolkit. http://www.cs.umass.edu/
mccallum/mallet.
[TM08]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Artur</given-names>
            <surname>Abdullin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Olfa</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          .
          <article-title>Clustering heterogeneous data sets</article-title>
          .
          <source>In Web Congress (LA-WEB)</source>
          ,
          <source>2012 Eighth Latin American</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [APM+13]
          <string-name>
            <given-names>Luca</given-names>
            <surname>Maria</surname>
          </string-name>
          <string-name>
            <surname>Aiello</surname>
          </string-name>
          , Georgios Petkos, Carlos Martin,
          <string-name>
            <surname>David Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Symeon</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , Ryan Skraba, Ayse Goker, Ioannis Kompatsiaris, and
          <string-name>
            <given-names>Alejandro</given-names>
            <surname>Jaimes</surname>
          </string-name>
          .
          <article-title>Sensing trending topics in twitter</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [PCA14]
          <string-name>
            <given-names>Yue</given-names>
            <surname>Lu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Chengxiang</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <article-title>Opinion integration through semi-supervised topic modeling</article-title>
          .
          <source>In Proceedings of the 17th international conference on World wide web</source>
          , pages
          <fpage>121</fpage>
          -
          <lpage>130</lpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [NASW09]
          <string-name>
            <given-names>David</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Arthur</given-names>
            <surname>Asuncion</surname>
          </string-name>
          , Padhraic Smyth, and
          <string-name>
            <given-names>Max</given-names>
            <surname>Welling</surname>
          </string-name>
          .
          <article-title>Distributed algorithms for topic models</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>10</volume>
          :
          <fpage>1801</fpage>
          -
          <lpage>1828</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Symeon</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , David Corney, and Luca Maria Aiello.
          <article-title>Snow 2014 data challenge: Assessing the performance of news topic detection methods in social media</article-title>
          .
          <source>In Proceedings of the SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Pentti</given-names>
            <surname>Paatero</surname>
          </string-name>
          and
          <string-name>
            <given-names>Unto</given-names>
            <surname>Tapper</surname>
          </string-name>
          .
          <article-title>Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values</article-title>
          .
          <source>Environmetrics</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <fpage>111</fpage>
          -
          <lpage>126</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [SNT+14]
          <string-name>
            <given-names>S</given-names>
            <surname>Schifferes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Thurman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.S.</given-names>
            <surname>Goker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C</given-names>
            <surname>Martin</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>Identifying and verifying news through social media: Developing a user-centered tool for professional journalists</article-title>
          .
          <source>Digital Journalism</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Titov and Ryan McDonald</surname>
          </string-name>
          .
          <article-title>Modeling online reviews with multi-grain topic models</article-title>
          .
          <source>In Proceedings of the 17th international conference on World Wide Web</source>
          , pages
          <fpage>111</fpage>
          -
          <lpage>120</lpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [WWC05]
          <string-name>
            <given-names>Janyce</given-names>
            <surname>Wiebe</surname>
          </string-name>
          , Theresa Wilson, and
          <string-name>
            <given-names>Claire</given-names>
            <surname>Cardie</surname>
          </string-name>
          .
          <article-title>Annotating expressions of opinions and emotions in language</article-title>
          .
          <source>Language resources and evaluation</source>
          ,
          <volume>39</volume>
          (
          <issue>2-3</issue>
          ):
          <fpage>165</fpage>
          -
          <lpage>210</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [YMM09]
          <string-name>
            <given-names>Limin</given-names>
            <surname>Yao</surname>
          </string-name>
          , David Mimno, and
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          .
          <article-title>Efficient methods for topic model inference on streaming document collections</article-title>
          .
          <source>In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>937</fpage>
          -
          <lpage>946</lpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [ZBG13]
          <article-title>Ke Zhai and Jordan Boyd-Graber. Online topic models with infinite vocabulary</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>In International Conference on Machine Learning</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>