=Paper= {{Paper |id=Vol-1150/nutakki |storemode=property |title=Distributed LDA-based Topic Modeling and Topic Agglomeration in a Latent Space |pdfUrl=https://ceur-ws.org/Vol-1150/nutakki.pdf |volume=Vol-1150 |dblpUrl=https://dblp.org/rec/conf/www/NutakkiNABS14 }} ==Distributed LDA-based Topic Modeling and Topic Agglomeration in a Latent Space== https://ceur-ws.org/Vol-1150/nutakki.pdf
       Distributed LDA based Topic Modeling and Topic
               Agglomeration in a Latent Space

              Gopi Chand Nutakki                                             Olfa Nasraoui
      Knowlede Discovery & Web Mining Lab                        Knowlede Discovery & Web Mining Lab
             University of Louisville                                   University of Louisville
             g0nuta01@louisville.edu                                  olfa.nasraoui@louisville.edu

              Behnoush Abdollahi                                Mahsa Badami
      Knowlede Discovery & Web Mining Lab          Knowlede Discovery & Web Mining Lab
             University of Louisville                        University of Louisville
            b0abdo03@louisville.edu                         m0bada01@louisville.edu
                                         Wenlong Sun
                           Knowlede Discovery & Web Mining Lab
                                     University of Louisville
                                    w0sun005@louisville.edu



                                                                     sentiment detection is performed to partition
                                                                     the tweets based on polarity, prior to topic
                       Abstract                                      modeling.

    We describe the methodology that we followed
    to automatically extract topics corresponding               1     Introduction
    to known events provided by the SNOW 2014                   The SNOW 2014 challenge was organized within the
    challenge in the context of the SocialSensor                context of the SocialSensor project1 , which works on
    project. A data crawling tool and selected fil-             developing a new framework for enabling real-time
    tering terms were provided to all the teams.                multimedia indexing and search in the Social Web.
    The crawled data was to be divided in 96                    The aim of the challenge was to automatically extract
    (15-minute) timeslots spanning a 24 hour pe-                topics corresponding to known events that were pre-
    riod and participants were asked to produce a               scribed by the challenge organizers. Also provided,
    fixed number of topics for the selected times-              was a data crawling tool along with several Twitter fil-
    lots. Our preliminary results are obtained us-              ter terms (syria, ukraine, bitcoin, terror). The crawled
    ing a methodology that pulls strengths from                 data was to be divided in a total of 96 (15-minute)
    several machine learning techniques, including              timeslots spanning a 24 hour period, with a goal of
    Latent Dirichlet Allocation (LDA) for topic                 extracting a fixed number of topics in each timeslot.
    modeling and Non-negative Matrix Factoriza-                 Only tweets up to the end of the timeslot could be
    tion (NMF) for automated hashtag annota-                    used to extract any topic. In this paper, we focuse on
    tion and for mapping the topics into a latent               the topic extraction task, instead of input data filter-
    space where they become less fragmented and                 ing, or presentation of associated headline, tweets and
    can be better related with one another. In ad-              image URL, because this was one of the activities clos-
    dition, we obtain improved topic quality when               est to the ongoing research [AN12, HN12, CBGN12]
Copyright c by the paper’s authors. Copying permitted only      on multi-domain data stream clustering in the Knowl-
for private and academic purposes.                              edge Discovery & Web Mining Lab at the University of
In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings   Louisville. To extract topics from the tweets crawled
of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014,
published at http://ceur-ws.org                                     1 SocialSensor: http://www.socialsensor.eu/
                                                                             Table 1: Description of used variables.
                                                                            Symbol                        Description

                                                                              M           Number of documents in collection
                                                                              W         Number of distinct words in vocabulary
                                                                              N           Total number of words in collection
                                                                              K                     Number of topics
                                                                              xdi          ith observed word in document d
                                                                              zdi               Topic assigned to xdi
                                                                             Nwk           Count of word assigned to topic
                                                                             Ndk         Count of topic assigned in document
                                                                              φk           Probability of word given topic k
                                                                              θd        Probability of topic given document d
Figure 1: Topic Modeling Framework (sentiment de-                             α,β                    Dirichlet priors
tection and hashtag annotation are not shown).
in each time slot, we use a Latent Dirichlet Allocation               Algorithm 1 Latent Dirichlet Allocation.
(LDA) based technique. We then discover latent con-                   Input: A document collection, hyper-parameters α
cepts using Non-negative Matrix Factorization (NMF)                      and β.
on the resulting topics, and apply hierarchical cluster-
ing within the resulting Latent Space (LS) in order to                Output: A list of topics.
agglomerate these topics into less fragmented themes
                                                                       1. Draw a distribution over topics, θd ∼ Dir(α)
that can facilitate the visual inspection of how the dif-
ferent topics are inter-related. We have also experi-                  2. For Each word i in the document:
mented with adding a sentiment detection step prior
to topic modeling in order to obtain a polarity sensitive              3.    Draw a topic index zdi ∈ {1, · · · , K}
topic discovery, and automated hashtag annotation to                           from the topic weights zdi ∼ θd .
improve the topic extraction.
                                                                       4.    Draw the observed word wdi
2      Background                                                              from the selected topic, wdi ∼ βzdi
2.1     Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a Bayesian prob-
                                                                                                ˆ
abilistic model for text documents. It assumes a col-
lection of K topics where each topic defines a multi-                 p (D|α, β)    =    ΠM
                                                                                          d=1       p (θd |α)
nomial over the vocabulary, which is assumed to have                                                X
                                                                                                                                    !
been drawn from a Dirichlet process [BNJ03][HBB10].                                        ΠNd
                                                                                            n=1           p (zdn |θd ) p (wdn |zdn , β) dθd
Given the topics, LDA assumes the generative process                                                zdn
for each document d, shown in Algorithm 1, where the
notation is listed in Table 1. Equation 1 gives the joint                The posterior is usually approximated using Markov
distribution of a topic mixture θ, a set of N topics z,               Chain Monte Carlo (MCMC) methods or variational
and a set of N words w for parameters α and β.                        inference. Both methods are effective, but face signif-
                                                                      icant computational challenges in the face of massive
                                                                      data sets. For this reason, we concentrated on a dis-
                                 N
                                 Y                                    tributed version of LDA which is summarized in the
    p (θ, z, w|α, β) = p (θ|α)         p (zn |θ) p (wn |zn , β) (1)   next section.
                                 n=1

   Integrating over θ and summing over z, we obtain                   2.2   Distributed Algorithms for LDA
the marginal distribution of a document [BNJ03]:
                                                             It is possible to distribute non-collapsed Gibbs sam-
             ˆ                                           ! pling, because sampling of zdi can happen indepen-
                             X                               dently given θd and φk , and thus can be done concur-
p (w|α, β) = p (θ|α) ΠN  n=1     p (zn |θ) p (wn |zn , β) dθ rently. In a non-collapsed Gibbs sampler, one samples
                              zn
                                                             zdi given θd and φk , and then θd and φk given zdi . If
   Taking the product of the marginal probabilities of       individual documents are not spread across different
single documents, the probability of a corpus D can          processors, one can marginalize over just θd , since θd is
be obtained:                                                 processor-specific. In this partially collapsed scheme,
the latent variables zdi on each processor can be con-                        Algorithm 3 Approximate Distributed LDA
currently sampled where the concurrency is over pro-                          [NASW09].
cessors. The slow convergence of partially collapsed                          Input: A list of M                 documents, x =
and non-collapsed Gibbs samplers (due to the strong                              {x1 , · · · , xp , · · · , xP }
dependencies between the parameters and latent vari-
ables) has led to devising distributed algorithms for                         Output: z = {z1 , · · · , zp , · · · , zP }
fully collapsed Gibbs samplers [NASW09][YMM09].
   Given M documents and P processors, with ap-                                   1. Repeat
proximately MP = M            P documents, distributed on each                    2.     For each processor p in parallel do
processor p, the M documents are partitioned into x =
{x1 , · · · , xp , · · · , xP } and z = {z1 , · · · , zp , · · · , zP } be-       3.       Copy global counts: Nwkp ← Nwk
ing the corresponding topic assignments, where pro-
cessor p stores xp , the words from documents j =                                 4.       Sample zp locally:
(p − 1) MP + 1, · · · , pMP and zP , the corresponding                                          LDAGibbsItr(xp ,zp ,Ndkp ,Nwkp ,α,β)   //
topic assignments. Topic-document counts Ndk are                                       Alg: 2
likewise distributed as Ndkp . The word-topic counts
Nwk are also distributed, with each processor p keep-                             5.     Synchronize
ing a separate local copy Nwkp .                                                  6.     Update global counts:
Algorithm 2 Standard Collapsed Gibbs Sampling.
                                                                                                         P
                                                                                          Nwk ← Nwk + p (Nwkp − Nwk )
LDAGibbsItr( |xp |, zp , Ndkp , Nwkp , α, β):
                                                                                  7. Until termination criterion is satisfied
  1. For Each d ∈ {1, · · · , M }
  2.     For Each i ∈ {1, · · · , Ndkp }                                      of the AD-LDA algorithm which can terminate after
                                                                              a fixed number of iterations, or based on a suitable
  3.        v ← xdpi , Tdpi ← Ndkpi                                           MCMC convergence metric. The AD-LDA algorithm
                                                                              samples from an approximation to the posterior distri-
  4.        For Each j ∈ {1, · · · , Tdkpi }                                  bution by allowing different processors to concurrently
                                                                              sample topic assignments on their local subsets of the
  5.           k̂ ← zdpij
                                                                              data. AD-LDA works well empirically and accelerates
  6.           Ndkp ← Ndkp − 1, Nwkp ← Nwkp − 1                               the topic modeling process.

  7.           For k = 1 to K                                                 3        Topic Extraction Methodology
  8.               ρk ← ρk−1 + (Ndkp + α)                                     3.1       Data Preprocessing
                                       P
                       × (Nkwp + β) / ( w0 Nw0 k ) + N β                      The dataset consists of tweets that were acquired from
                                                                              the Twitter servers by continuous querying using a
  9.           x ∼ U nif ormDistribution(0, ρk )
                                                                              wrapper for the Twitter API over a period of 24 hours.
                                                                              The batch of tweets are acquired in raw JSON2 format.
                                                     
10.            k̂ ← BinarySearch k̂ : ρk̂−1 < x < ρk̂
                                                                              Various properties of the tweet such as the hashtags,
11.            Ndk̂p ← Ndk̂p + 1,Nwk̂p ← Nwk̂p + 1                            URLs, creation time, counts for retweets and favorites,
                                                                              and other user information including the encoding and
12.            zdpij ← k̂                                                     language are extracted. The hashtags can provide a
                                                                              good source for creating discriminating features and
                                                                              they were folded as terms into the bag of words model
   Although Gibbs sampling is a sequential process,
                                                                              for each tweet where they were present (without the
given the typically large number of word tokens com-
                                                                              ’#’ prefix). The URLs can also later provide a method
pared to the number of processors, the dependence of
                                                                              to achieve topic summarization.
zij on the update of any other topic assignment zi0 j 0 is
likely to be weak, thus relaxing the sequential sampling                      3.2       Topic Extraction Stages
constraint. If two processors are concurrently sam-
pling, but with different words in different documents,                       The technique assumes a real time streaming data in-
then concurrent sampling will approximate sequential                          put and is replicated using process calls to the stor-
sampling. This is because the only term affecting the                         age records containing the tweets. For AD-LDA, each
order ofP the update operations is the total word-topic                          2 JSON: JavaScript Object Notation, is a text-based open

counts w Nwk . Algorithm 3 shows the pseudocode                               standard designed for human-readable data interchange
Figure 2: Dendrogram depicting few clusters’ hierarchy of top-
ics from the initial window. Agglomeration is based on the
cosine similarity. Average-Linkage Agglomerative Hierarchical
Clustering was used. Distance is computed as (1 − similarity).
Refer to the electronic version of the paper for clarity

tweet is considered as a single document. Figure 1
shows the steps performed to extract the topics in each
                                                                 Figure 3: Portion of dendrogram depicting the clusters’ hier-
window or time slot. The procedure starts with the
                                                                 archy of topics from the initial window (Number 0). Agglomera-
extraction of key information from the Twitter JSON,
                                                                 tion is based on the dot product between the topics’ projections
then the tweet text and other properties are used to
                                                                 on a lower dimensional latent space extracted using NMF with
extract topics. The topic extraction is performed using
                                                                 kf = 30 factors. Average-Linkage Agglomerative Hierarchical
the following steps:
                                                                 Clustering was used. Distance is computed as 1 minus similar-
    1. The documents are stripped of non-English char-           ity. Refer to the electronic version of the paper for clarity.
       acters and are converted to lowercase. The stop           Bayes classifier [LGD11] trained with labeled tweet
       words are retained for the context information (es-       samples4 and a set of labeled tokens5 with known sen-
       pecially for sentiment detection).                        timent polarity can be used to extract the sentiment
                                                                 levels. The tweets are then regrouped based on the
    2. Groups of documents are assembled into windows
                                                                 sentiment level and the topic modeling is applied on
       based on their timestamp. A sliding window’s
                                                                 each group, resulting in topics that are confined to one
       width is equal to three consecutive time slots end-
                                                                 sentiment, as illustrated in Table 2.
       ing in the current time slot.

    3. AD-LDA technique is performed on a 20 node                5     Topic Agglomeration in a Latent
       cluster. From each sliding window iteration a to-               Space
       tal of 1000 topics are extracted, this higher value
       help in extracting finer topics.                          5.1     Discovering Latent Factors Among the
                                                                         Discovered Topics Using Non-negative
    4. The topics are ranked based on the proportion of                  Matrix Factorization (NMF)
       tweets assigned to the topic in the given window,
                                                                 Because initial topic modeling generated a high num-
       and then can be clustered/merged together to or-
                                                                 ber of topics (1000 topics per window), that were fur-
       ganize them into more general topic groups.
                                                                 thermore very sparse in terms of the descriptive terms
The jsoup open source HTML parser3 was used to ex-               within them, these topics were hard to interpret and
tract multimedia content such as images and metadata             could benefit from a coarser, less fragmented organi-
from the URLs extracted from the tweets. The head-               zation. One way to fix this problem was to merge
lines are part of the metadata while the keywords are            the topics based on a conceptual similarity by apply-
obtained from the topic modeling itself as the terms             ing Non-negative Matrix Factorization (NMF) [LS99].
with highest probability in the topic.                           Because the topics by words matrix is a very sparse
                                                                 matrix, we used NMF to project the topics onto a
4      Topic Extraction with AD-LDA                              common lower-dimensional latent factors’ space. NMF
                                                                 takes as input the matrix X of n topics by m words
       and Sentiment Labels                                      (as binary features) and decomposes it into two fac-
The AD-LDA technique with Gibbs sampling, along                  tor matrices (A and B) which represent the topics and
with automatically extracted sentiment labels can also           words, respectively, in a kf -dimensional latent space,
be used to extract polarity-sensitive topics. Using sen-         as follows:
timent labels may improve the quality of the topics as
it results in finer topics. Figure 1 depicts the general                          Xn×m ' An×kf BT m×kf                       (2)
flow within the used methodology. A weighted Naive
                                                                     4 https://github.com/ravikiranj/twitter-sentiment-analyzer
    3 http://jsoup.org/                                              5 https://github.com/linron84/JST
                Positive Sentiment                                              Negative Sentiment
     optimistic ukraine antiwar nonintervention                  horrible building badge hiding ukraine yanukovych
   syria refugees about education children million             syria yarmouk camp crisis food waiting unrest shocking
  future technology bitcoins value law accelerating                      cnn protocols loss gox bitcoin fault

   Table 2: Illustrating a sample of the finer topics extracted after a preliminary sentiment detection phase.
                                                            Algorithm 4 Basic Alternating Least Square (ALS)
                                                            Algorithm for NMF
                                                                   Input: Data matrix X, number of factors kf
                                                                   Output: optimal matrices A and B
                                                                    1. Initialize matrix A (for example randomly)

                                                                    2. Repeat

                                                                         (a) Solve for B in the equation: AT AB = AT X
                                                                         (b) Project solution onto non-negative matrix
                                                                             subspace: set all negative values in B to ze-
                                                                             ros
Figure 4: Portion of dendrogram depicting the clusters’ hier-            (c) Solve for A in the equation: BBT AT =
archy of topics from the first 6 windows. Agglomeration is based             BXT
on the dot product between the topics’ projections on a lower
                                                                         (d) Project solution onto non-negative matrix
dimensional latent space extracted using NMF with kf = 30
                                                                             subspace: set all negative values in A to ze-
factors. Average-Linkage Agglomerative Hierarchical Cluster-
                                                                             ros
ing was used. Distance is computed as 1 minus similarity. Refer
to the electronic version of the paper for clarity.                 3. Until Cost function decrease is below threshold


   where kf is the approximated rank of matrices A                 5.2   Topic Organization Stages: Topic Fea-
and B, and is selected such that kf < min(m, n), so                      ture Extraction, Latent Space Compu-
that the number of elements in the decomposition ma-                     tation using NMF, Latent Space-based
trices is far less than the number of elements of the                    Topic Similarity Computation, and Hier-
original matrix: nkf + kf m  nm.                                        archical Clustering
                                                                   In the following, we summarize the steps that are ap-
   Topics factor (A) can then be used to find the sim-
                                                                   plied post-discovery of the topics, in order to generate
ilarity between the topics in the new latent space in-
                                                                   a hierarchical organization from the sparse topics.
stead of using the original space of original terms. the
obtained similarity matrix from the NMF factors can                 1. Preprocessing of the topic vectors: For each win-
finally be used to cluster the topics.                                 dow, the topic-word matrix (Xn×m ) is extracted
                                                                       from the final topic modeling results. The fea-
   To find A and B, the Frobenius norm of errors be-
                                                                       tures are the top words in a topic, and they are
tween the data and the approximation is optimized, as
                                                                       binary (1 if a topic has the word in question and
follows
                                                                       0 otherwise).
                                                                    2. Latent Factor Discovery using NMF : The topic-
                                                                       word data was normalized before running NMF.
                                                                       The latter produces two factors (A and B), where
            JN M F = ||E||2F = || X − ABT ||2F              (3)
                                                                       n, kf and m are the number of topics, latent fac-
                                                                       tors and words, respectively. Our main goal was
                                                                       to compute the Matrix A, also called the topics
                                                                       basis factor, which transfers the topics to the la-
   Several algorithms have been proposed in the liter-                 tent space. Choosing the number of factors), kf ,
ature to minimize this cost. We used an Alternating                    has an impact on the results,. After trial and er-
Least Square (ALS) method [PT94] that iteratively                      ror, we chose kf = 30.
solves for the factors, by assuming that the problem is
convex in either one of the factor matrices alone.                  3. Generating the topic-similarity matrix in the la-
                                                                  perplexity score indicating better generalization per-
                                                                  formance.
                                                                              For a test set of T documents D0 =
                                                                    →(1)     →(T )
                                                                    w ,··· ,w         and Nd being the total number
                                                                  of keywords in dth document, the perplexity given in
                                                                  Equation 4, will be lower for a better topic model.
                                                                  Figure 5 shows the perplexity trends, suggesting that
                                                                  more topics result in lower (thus better) perplexity.
                                                                  Also, irrespective of the number of topics, AD-LDA-
Figure 5: Perplexity trends for each sliding window of            based topic modeling can extract topics of good qual-
width three for various numbers of extracted topics.              ity.
       tent space: The computed topic basis matrix (A)                                                           
                                                                                           PT          →(d) →−
       was used to obtain the similarity in lieu of the                                 
                                                                                        
                                                                                           d=1 ln p   w   | α , β  
                                                                                                                    
                                                                                                                    
       original topic vectors. The normalized inner prod-                       0
                                                                   perplexity (D ) = exp −       PT
       uct of the matrix A and its transpose was calcu-                                             d=1 Nd
                                                                                        
                                                                                                                   
                                                                                                                    
                                                                                                                   
       lated for this purpose. Normalization of the prod-
       uct is equivalent to computing the Cosine simi-                                                                  (4)
       larity between topic pairs. The resulting matrix
       contains the pairwise similarity between each pair         6.2   Sentiment Based Topic Modeling
       of topics within the latent space.                         Table 2 shows a subset of topics, extracted from the
                                                                  positive and negative sentiment groups of tweets, and
    4. Hierarchical Clustering of the latent space-               these tend to be more refined than the standard un-
       projected topics based on the new pairwise similar-        sentimental topics. From the initial window, 1000 top-
       ity scores computed in Step 3 : we experimented            ics were extracted in the same way as the Distributed
       with several linkage strategies such as single and         LDA, however topic modeling was preceded by a senti-
       average linkage. The latter was chosen as optimal.         ment classifier that classifies the tweets based on their
                                                                  sentiment (positive or negative). Although positive
5.3     Automated Hashtag Annotation                              and negative topics still share a few keywords, they
                                                                  are clearly divided by sentiment.
We have also experimented with a simple tag comple-
tion or prediction step prior to topic modeling. An-              6.3   Topic Clustering in the Latent Space
notation for a given tweet is determined by finding
the top frequent tags associated with the KLS 6 near-             Figure 3 shows the topic clusters created using the
est neighboring tweets in the NMF-computed Latent                 latent space-projected features extracted using NMF.
Space to the given tweet. Once the tags are com-                  The clusters in Figure 3 seem to have better quality
pleted, they are used to enrich the tweets before topic           compared to the clusters in Figure 2 because of the
modeling. Of course, only the tweets bag of word de-              more accurate capture of pairwise similarities between
scriptions in a given window are used to compute the              topics in the conceptual space. Figure 4 shows the
NMF for that window’s topic modeling. The annota-                 clustering of the top 10 topics for a series of 6 windows,
tion generally resulted in lower Perplexity of the ex-            showing how the agglomeration can consolidate the
tracted topic models, as shown in Figure 6.                       topics discovered at different time slots, helping avoid
                                                                  excessive fragmentation throughout the stream’s life.

6      Results                                                    6.4   Automated Hashtag Annotation
6.1     Distributed LDA-based Topic Modeling                      Tweet data is very sparse and not every tweet has valu-
                                                                  able tags. To overcome this weakness, we applied an
Figure 2 shows7 a sample of the topic clusters’ hi-
                                                                  NMF-based automated tweet annotation before topic
erarchy extracted from the initial window and with-
                                                                  modeling. Adding the predicted hashtags to the tweets
out NMF-based latent space projection of the top-
                                                                  enhanced the topic modeling. The automated tag an-
ics. The clusters are of debatable quality. Per-
                                                                  notation, described in Section 5.3, generally resulted
plexity is a common metric to evaluate language
                                                                  in lower Perplexity of the extracted topic models, as
models [BL06][BNJ03]. It is monotonically decreas-
                                                                  shown in Figure 6, suggesting that the auto-completed
ing in the likelihood of the test data, with a lower
                                                                  tags did help complete some missing and valuable in-
    6 we report results for K
                              LS = 5
                                                                  formation in the sparse tweet data, thus helping the
    7 Refer to the electronic version of the paper for clarity.   topic modeling.
                                                          [BM10]     David M Blei and Jon D McAuliffe. Su-
                                                                     pervised topic models. arXiv preprint
                                                                     arXiv:1003.0783, 2010.
                                                          [BNJ03]    David M Blei, Andrew Y Ng, and Michael I
                                                                     Jordan. Latent dirichlet allocation. the
                                                                     Journal of machine Learning research,
                                                                     3:993–1022, 2003.
Figure 6: Perplexity for different numbers of topics      [CBGN12] Juan C Caicedo, Jaafar BenAbdallah,
and varying window length, showing improved results                Fabio A González, and Olfa Nasraoui.
when NMF-based automated tweet annotation is per-                  Multimodal representation, indexing, au-
formed before topic modeling.                                      tomated annotation and retrieval of image
7   Conclusion                                                     collections via non-negative matrix fac-
                                                                   torization. Neurocomputing, 76(1):50–60,
Using Distributed LDA topic modeling, followed by
                                                                   2012.
NMF and hierarchical clustering within the resulting
Latent Space (LS), helped organize the topics into less   [HBB10]    Matthew Hoffman, David M Blei, and
fragmented themes. Sentiment detection prior to topic                Francis Bach. Online learning for latent
modeling and automated hashtag annotation helped                     dirichlet allocation. Advances in Neural
improve the learned topic models, while the agglom-                  Information Processing Systems, 23:856–
eration of topics across several time windows can link               864, 2010.
the topics discovered at different time windows. Our
focus was on the topic modeling and organization us-      [HN12]     Basheer Hawwash and Olfa Nasraoui.
ing the simplest (bag of words) features. Special-                   Stream-dashboard: a framework for min-
ized twitter feature extraction and selection methods,               ing, tracking and validating clusters in
such as the ones surveyed and proposed by Aiello et                  a data stream. In Proceedings of the
al. [APM+ 13], have the potential to improve our re-                 1st International Workshop on Big Data,
sults, a direction we will explore in the future. An-                Streams and Heterogeneous Source Min-
other direction to explore is the news domain specific,              ing: Algorithms, Systems, Programming
user-centered approach, discussed in [SNT+ 14] and a                 Models and Applications, pages 109–117.
more expanded use of automated annotation to sup-                    ACM, 2012.
port topic extraction and description.                    [LGD11]    Chang-Hwan Lee, Fernando Gutierrez,
                                                                     and Dejing Dou.        Calculating feature
8   Acknowledgements                                                 weights in naive bayes with kullback-leibler
We would like to thank the organizers of the SNOW                    measure. In Data Mining (ICDM), 2011
2014 workshop, in particular the members of the So-                  IEEE 11th International Conference on,
cialSensor team for their leadership in all the phases               pages 1146–1151. IEEE, 2011.
of the competition.
                                                          [LH09]     Chenghua Lin and Yulan He. Joint senti-
                                                                     ment/topic model for sentiment analysis.
References                                                           In Proceedings of the 18th ACM confer-
[AN12]     Artur Abdullin and Olfa Nasraoui. Clus-                   ence on Information and knowledge man-
           tering heterogeneous data sets. In Web                    agement, pages 375–384. ACM, 2009.
           Congress (LA-WEB), 2012 Eighth Latin
           American, pages 1–8. IEEE, 2012.               [LHAY07] Yang Liu, Xiangji Huang, Aijun An, and
                                                                   Xiaohui Yu. Arsa: a sentiment-aware
[APM+ 13] Luca Maria Aiello, Georgios Petkos, Car-                 model for predicting sales performance us-
          los Martin, David Corney, Symeon Pa-                     ing blogs. In Proceedings of the 30th an-
          padopoulos, Ryan Skraba, Ayse Goker,                     nual international ACM SIGIR conference
          Ioannis Kompatsiaris, and Alejandro                      on Research and development in informa-
          Jaimes. Sensing trending topics in twitter.              tion retrieval, pages 607–614. ACM, 2007.
          IEEE Transactions on Multimedia, 2013.
                                                          [LS99]     Daniel D Lee and H Sebastian Seung.
[BL06]     David Blei and John Lafferty. Correlated                  Learning the parts of objects by non-
           topic models. Advances in neural informa-                 negative matrix factorization. Nature,
           tion processing systems, 18:147, 2006.                    401(6755):788–791, 1999.
[LZ08]    Yue Lu and Chengxiang Zhai. Opinion                     Identifying and verifying news through so-
          integration through semi-supervised topic               cial media: Developing a user-centered
          modeling. In Proceedings of the 17th in-                tool for professional journalists. Digital
          ternational conference on World wide web,               Journalism, 2014.
          pages 121–130. ACM, 2008.
                                                        [TM08]    Ivan Titov and Ryan McDonald. Model-
[McC]     Mallet: A machine learning for language                 ing online reviews with multi-grain topic
          toolkit. http://www.cs.umass.edu/ mccal-                models. In Proceedings of the 17th inter-
          lum/mallet.                                             national conference on World Wide Web,
                                                                  pages 111–120. ACM, 2008.
[NASW09] David Newman,         Arthur Asuncion,
         Padhraic Smyth, and Max Welling. Dis-          [WWC05] Janyce Wiebe, Theresa Wilson, and Claire
         tributed algorithms for topic models. The              Cardie. Annotating expressions of opin-
         Journal of Machine Learning Research,                  ions and emotions in language. Language
         10:1801–1828, 2009.                                    resources and evaluation, 39(2-3):165–210,
[PCA14]   Symeon Papadopoulos, David Corney, and                2005.
          Luca Maria Aiello. Snow 2014 data chal-
                                                        [YMM09] Limin Yao, David Mimno, and Andrew
          lenge: Assessing the performance of news
                                                                McCallum. Efficient methods for topic
          topic detection methods in social media. In
                                                                model inference on streaming document
          Proceedings of the SNOW 2014 Data Chal-
                                                                collections. In Proceedings of the 15th
          lenge, 2014.
                                                                ACM SIGKDD international conference
[PT94]    Pentti Paatero and Unto Tapper. Positive              on Knowledge discovery and data mining,
          matrix factorization: A non-negative fac-             pages 937–946. ACM, 2009.
          tor model with optimal utilization of error
                                                        [ZBG13]   Ke Zhai and Jordan Boyd-Graber. On-
          estimates of data values. Environmetrics,
                                                                  line topic models with infinite vocabulary.
          5(2):111–126, 1994.
                                                                  In International Conference on Machine
[SNT+ 14] S Schifferes, N. Newman, N. Thurman,                    Learning, 2013.
          D. Corney, A.S. Goker, and C Martin.