Distributed LDA based Topic Modeling and Topic Agglomeration in a Latent Space

Distributed LDA based Topic Modeling and Topic Agglomeration in a Latent Space ChandGopi Knowlede Discovery & Web Mining Lab University of Louisville Nutakki Knowlede Discovery & Web Mining Lab University of Louisville OlfaNasraoui olfa.nasraoui@louisville.edu Knowlede Discovery & Web Mining Lab University of Louisville BehnoushAbdollahi Knowlede Discovery & Web Mining Lab University of Louisville MahsaBadami m0bada01@louisville.edu Knowlede Discovery & Web Mining Lab University of Louisville WenlongSun Knowlede Discovery & Web Mining Lab University of Louisville Distributed LDA based Topic Modeling and Topic Agglomeration in a Latent Space 9EB53C60EA608F5C6FA08DCCF1E9ECE2 GROBID - A machine learning software for extracting information from scholarly documents

We describe the methodology that we followed to automatically extract topics corresponding to known events provided by the SNOW 2014 challenge in the context of the SocialSensor project. A data crawling tool and selected filtering terms were provided to all the teams. The crawled data was to be divided in 96 (15-minute) timeslots spanning a 24 hour period and participants were asked to produce a fixed number of topics for the selected timeslots. Our preliminary results are obtained using a methodology that pulls strengths from several machine learning techniques, including Latent Dirichlet Allocation (LDA) for topic modeling and Non-negative Matrix Factorization (NMF) for automated hashtag annotation and for mapping the topics into a latent space where they become less fragmented and can be better related with one another. In addition, we obtain improved topic quality when

Introduction

The SNOW 2014 challenge was organized within the context of the SocialSensor project1 , which works on developing a new framework for enabling real-time multimedia indexing and search in the Social Web. The aim of the challenge was to automatically extract topics corresponding to known events that were prescribed by the challenge organizers. Also provided, was a data crawling tool along with several Twitter filter terms (syria, ukraine, bitcoin, terror). The crawled data was to be divided in a total of 96 (15-minute) timeslots spanning a 24 hour period, with a goal of extracting a fixed number of topics in each timeslot. Only tweets up to the end of the timeslot could be used to extract any topic. In this paper, we focuse on the topic extraction task, instead of input data filtering, or presentation of associated headline, tweets and image URL, because this was one of the activities closest to the ongoing research [AN12, HN12, CBGN12] on multi-domain data stream clustering in the Knowledge Discovery & Web Mining Lab at the University of Louisville. To extract topics from the tweets crawled in each time slot, we use a Latent Dirichlet Allocation (LDA) based technique. We then discover latent concepts using Non-negative Matrix Factorization (NMF) on the resulting topics, and apply hierarchical clustering within the resulting Latent Space (LS) in order to agglomerate these topics into less fragmented themes that can facilitate the visual inspection of how the different topics are inter-related. We have also experimented with adding a sentiment detection step prior to topic modeling in order to obtain a polarity sensitive topic discovery, and automated hashtag annotation to improve the topic extraction.

Background

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a Bayesian probabilistic model for text documents. It assumes a collection of K topics where each topic defines a multinomial over the vocabulary, which is assumed to have been drawn from a Dirichlet process [BNJ03] [HBB10]. Given the topics, LDA assumes the generative process for each document d, shown in Algorithm 1, where the notation is listed in Table 1. Equation 1 gives the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w for parameters α and β.

p (θ, z, w|α, β) = p (θ|α) N n=1 p (z n |θ) p (w n |z n , β) (1)

Integrating over θ and summing over z, we obtain the marginal distribution of a document [BNJ03]:

p (w|α, β) = ˆp (θ|α) Π N n=1 zn p (z n |θ) p (w n |z n , β) dθ

Taking the product of the marginal probabilities of single documents, the probability of a corpus D can be obtained: The posterior is usually approximated using Markov Chain Monte Carlo (MCMC) methods or variational inference. Both methods are effective, but face significant computational challenges in the face of massive data sets. For this reason, we concentrated on a distributed version of LDA which is summarized in the next section.

Distributed Algorithms for LDA

It is possible to distribute non-collapsed Gibbs sampling, because sampling of z di can happen independently given θ d and φ k , and thus can be done concurrently. In a non-collapsed Gibbs sampler, one samples z di given θ d and φ k , and then θ d and φ k given z di . If individual documents are not spread across different processors, one can marginalize over just θ d , since θ d is processor-specific. In this partially collapsed scheme, the latent variables z di on each processor can be concurrently sampled where the concurrency is over processors. The slow convergence of partially collapsed and non-collapsed Gibbs samplers (due to the strong dependencies between the parameters and latent variables) has led to devising distributed algorithms for fully collapsed Gibbs samplers [NASW09] [YMM09].

Given M documents and P processors, with approximately M P = M P documents, distributed on each processor p, the M documents are partitioned into

x = {x 1 , • • • , x p , • • • , x P } and z = {z 1 , • • • , z p , • • • , z P }

being the corresponding topic assignments, where processor p stores x p , the words from documents j = (p − 1) M P + 1, • • • , pM P and z P , the corresponding topic assignments. Topic-document counts N dk are likewise distributed as N dkp . The word-topic counts N wk are also distributed, with each processor p keeping a separate local copy N wkp .

Algorithm 2 Standard Collapsed Gibbs Sampling. LDAGibbsItr( |x p |, z p , N dkp , N wkp , α, β): 1. For Each d ∈ {1, • • • , M } 2. For Each i ∈ {1, • • • , N dkp } 3. v ← x dpi , T dpi ← N dkpi 4. For Each j ∈ {1, • • • , T dkpi } 5. k ← z dpij 6. N dkp ← N dkp − 1, N wkp ← N wkp − 1 7. For k = 1 to K 8. ρ k ← ρ k−1 + (N dkp + α) × (N kwp + β) / ( w N w k ) + N β 9. x ∼ U nif ormDistribution(0, ρ k ) 10. k ← BinarySearch k : ρ k−1 < x < ρ k 11. N d kp ← N d kp + 1,N w kp ← N w kp + 1 12. z dpij ← k

Although Gibbs sampling is a sequential process, given the typically large number of word tokens compared to the number of processors, the dependence of z ij on the update of any other topic assignment z i j is likely to be weak, thus relaxing the sequential sampling constraint. If two processors are concurrently sampling, but with different words in different documents, then concurrent sampling will approximate sequential sampling. This is because the only term affecting the order of the update operations is the total word-topic counts w N wk . Algorithm 3 shows the pseudocode Algorithm 3 Approximate Distributed LDA [NASW09].

Input: A list of M documents, x = {x 1 , • • • , x p , • • • , x P } Output: z = {z 1 , • • • , z p , • • • , z P } 1. Repeat 2.

For each processor p in parallel do

Copy

N wk ← N wk + p (N wkp − N wk )

7. Until termination criterion is satisfied of the AD-LDA algorithm which can terminate after a fixed number of iterations, or based on a suitable MCMC convergence metric. The AD-LDA algorithm samples from an approximation to the posterior distribution by allowing different processors to concurrently sample topic assignments on their local subsets of the data. AD-LDA works well empirically and accelerates the topic modeling process.

Topic Extraction Methodology

Data Preprocessing

The dataset consists of tweets that were acquired from the Twitter servers by continuous querying using a wrapper for the Twitter API over a period of 24 hours. The batch of tweets are acquired in raw JSON2 format. Various properties of the tweet such as the hashtags, URLs, creation time, counts for retweets and favorites, and other user information including the encoding and language are extracted. The hashtags can provide a good source for creating discriminating features and they were folded as terms into the bag of words model for each tweet where they were present (without the '#' prefix). The URLs can also later provide a method to achieve topic summarization.

Topic Extraction Stages

The technique assumes a real time streaming data input and is replicated using process calls to the storage records containing the tweets. For AD-LDA, each Refer to the electronic version of the paper for clarity tweet is considered as a single document. Figure 1 shows the steps performed to extract the topics in each window or time slot. The procedure starts with the extraction of key information from the Twitter JSON, then the tweet text and other properties are used to extract topics. The topic extraction is performed using the following steps:

1. The documents are stripped of non-English characters and are converted to lowercase. The stop words are retained for the context information (especially for sentiment detection).

2. Groups of documents are assembled into windows based on their timestamp. A sliding window's width is equal to three consecutive time slots ending in the current time slot.

3. AD-LDA technique is performed on a 20 node cluster. From each sliding window iteration a total of 1000 topics are extracted, this higher value help in extracting finer topics.

4. The topics are ranked based on the proportion of tweets assigned to the topic in the given window, and then can be clustered/merged together to organize them into more general topic groups.

The jsoup open source HTML parser 3 was used to extract multimedia content such as images and metadata from the URLs extracted from the tweets. The headlines are part of the metadata while the keywords are obtained from the topic modeling itself as the terms with highest probability in the topic.

Topic Extraction with AD-LDA and Sentiment Labels

The AD-LDA technique with Gibbs sampling, along with automatically extracted sentiment labels can also be used to extract polarity-sensitive topics. Using sentiment labels may improve the quality of the topics as it results in finer topics. Figure 1 Bayes classifier [LGD11] trained with labeled tweet samples4 and a set of labeled tokens5 with known sentiment polarity can be used to extract the sentiment levels. The tweets are then regrouped based on the sentiment level and the topic modeling is applied on each group, resulting in topics that are confined to one sentiment, as illustrated in Table 2.

Topic Agglomeration in a Latent Space

Discovering Latent Factors Among the Discovered Topics Using Non-negative Matrix Factorization (NMF)

Because initial topic modeling generated a high number of topics (1000 topics per window), that were furthermore very sparse in terms of the descriptive terms within them, these topics were hard to interpret and could benefit from a coarser, less fragmented organization. One way to fix this problem was to merge the topics based on a conceptual similarity by applying Non-negative Matrix Factorization (NMF) [LS99].

Because the topics by words matrix is a very sparse matrix, we used NMF to project the topics onto a common lower-dimensional latent factors' space. NMF takes as input the matrix X of n topics by m words (as binary features) and decomposes it into two factor matrices (A and B) which represent the topics and words, respectively, in a k f -dimensional latent space, as follows:

X n×m A n×k f B T m×k f (2)

Positive Sentiment

Negative Sentiment optimistic ukraine antiwar nonintervention horrible building badge hiding ukraine yanukovych syria refugees about education children million syria yarmouk camp crisis food waiting unrest shocking future technology bitcoins value law accelerating cnn protocols loss gox bitcoin fault

Table 2: Illustrating a sample of the finer topics extracted after a preliminary sentiment detection phase. where k f is the approximated rank of matrices A and B, and is selected such that k f < min(m, n), so that the number of elements in the decomposition matrices is far less than the number of elements of the original matrix:

nk f + k f m nm.

Topics factor (A) can then be used to find the similarity between the topics in the new latent space instead of using the original space of original terms. the obtained similarity matrix from the NMF factors can finally be used to cluster the topics.

To find A and B, the Frobenius norm of errors between the data and the approximation is optimized, as follows

J N M F = ||E|| 2 F = || X − AB T || 2 F (3)

Several algorithms have been proposed in the literature to minimize this cost. We used an Alternating Least Square (ALS) method [PT94] that iteratively solves for the factors, by assuming that the problem is convex in either one of the factor matrices alone. In the following, we summarize the steps that are applied post-discovery of the topics, in order to generate a hierarchical organization from the sparse topics.

1. Preprocessing of the topic vectors: For each window, the topic-word matrix (X n×m ) is extracted from the final topic modeling results. The features are the top words in a topic, and they are binary (1 if a topic has the word in question and 0 otherwise).

2. Latent Factor Discovery using NMF : The topicword data was normalized before running NMF. The latter produces two factors (A and B), where n, k f and m are the number of topics, latent factors and words, respectively. Our main goal was to compute the Matrix A, also called the topics basis factor, which transfers the topics to the latent space. Choosing the number of factors), k f , has an impact on the results,. After trial and error, we chose k f = 30.

3. Generating the topic-similarity matrix in the la- tent space: The computed topic basis matrix (A) was used to obtain the similarity in lieu of the original topic vectors. The normalized inner product of the matrix A and its transpose was calculated for this purpose. Normalization of the product is equivalent to computing the Cosine similarity between topic pairs. The resulting matrix contains the pairwise similarity between each pair of topics within the latent space.

Hierarchical Clustering of the latent spaceprojected topics based on the new pairwise similarity scores computed in

Step 3 : we experimented with several linkage strategies such as single and average linkage. The latter was chosen as optimal.

Automated Hashtag Annotation

We have also experimented with a simple tag completion or prediction step prior to topic modeling. Annotation for a given tweet is determined by finding the top frequent tags associated with the K LS 6 nearest neighboring tweets in the NMF-computed Latent Space to the given tweet. Once the tags are completed, they are used to enrich the tweets before topic modeling. Of course, only the tweets bag of word descriptions in a given window are used to compute the NMF for that window's topic modeling. The annotation generally resulted in lower Perplexity of the extracted topic models, as shown in Figure 6.

Results

Distributed LDA-based Topic Modeling

Figure 2 shows 7 a sample of the topic clusters' hierarchy extracted from the initial window and without NMF-based latent space projection of the topics. The clusters are of debatable quality. Perplexity is a common metric to evaluate language models [BL06] [BNJ03]. It is monotonically decreasing in the likelihood of the test data, with a lower 6 we report results for K LS = 5 7 Refer to the electronic version of the paper for clarity. and N d being the total number of keywords in d th document, the perplexity given in Equation 4, will be lower for a better topic model. Figure 5 shows the perplexity trends, suggesting that more topics result in lower (thus better) perplexity. Also, irrespective of the number of topics, AD-LDAbased topic modeling can extract topics of good quality.

perplexity (D ) = exp        − T d=1 ln p → w (d) | − → α , β T d=1 N d        (4)

Sentiment Based Topic Modeling

Table 2 shows a subset of topics, extracted from the positive and negative sentiment groups of tweets, and these tend to be more refined than the standard unsentimental topics. From the initial window, 1000 topics were extracted in the same way as the Distributed LDA, however topic modeling was preceded by a sentiment classifier that classifies the tweets based on their sentiment (positive or negative). Although positive and negative topics still share a few keywords, they are clearly divided by sentiment.

Topic Clustering in the Latent Space

Figure 3 shows the topic clusters created using the latent space-projected features extracted using NMF. The clusters in Figure 3 seem to have better quality compared to the clusters in Figure 2 because of the more accurate capture of pairwise similarities between topics in the conceptual space. Figure 4 shows the clustering of the top 10 topics for a series of 6 windows, showing how the agglomeration can consolidate the topics discovered at different time slots, helping avoid excessive fragmentation throughout the stream's life.

Automated Hashtag Annotation

Tweet data is very sparse and not every tweet has valuable tags. To overcome this weakness, we applied an NMF-based automated tweet annotation before topic modeling. Adding the predicted hashtags to the tweets enhanced the topic modeling. The automated tag annotation, described in Section 5.3, generally resulted in lower Perplexity of the extracted topic models, as shown in Figure 6, suggesting that the auto-completed tags did help complete some missing and valuable information in the sparse tweet data, thus helping the topic modeling.

Conclusion

Using Distributed LDA topic modeling, followed by NMF and hierarchical clustering within the resulting Latent Space (LS), helped organize the topics into less fragmented themes. Sentiment detection prior to topic modeling and automated hashtag annotation helped improve the learned topic models, while the agglomeration of topics across several time windows can link the topics discovered at different time windows. Our focus was on the topic modeling and organization using the simplest (bag of words) features. Specialized twitter feature extraction and selection methods, such as the ones surveyed and proposed by Aiello et al. [APM + 13], have the potential to improve our results, a direction we will explore in the future. Another direction to explore is the news domain specific, user-centered approach, discussed in [SNT + 14] and a more expanded use of automated annotation to support topic extraction and description.

Figure 1 :1Figure 1: Topic Modeling Framework (sentiment detection and hashtag annotation are not shown).

x di i th observed word in document d z di Topic assigned to x di N wk Count of word assigned to topic N dk Count of topic assigned in document φ k Probability of word given topic k θ d Probability of topic given document d α,β Dirichlet priors Algorithm 1 Latent Dirichlet Allocation. Input: A document collection, hyper-parameters α and β. Output: A list of topics. 1. Draw a distribution over topics, θ d ∼ Dir(α) 2. For Each word i in the document: 3. Draw a topic index z di ∈ {1, • • • , K} from the topic weights z di ∼ θ d . 4. Draw the observed word w di from the selected topic, w di ∼ β z di p (D|α, β) = Π M d=1 ˆp (θ d |α) Π N d n=1 z dn p (z dn |θ d ) p (w dn |z dn , β) dθ d

Figure 2 :2Figure 2: Dendrogram depicting few clusters' hierarchy of topics from the initial window. Agglomeration is based on the cosine similarity. Average-Linkage Agglomerative Hierarchical Clustering was used. Distance is computed as (1 − similarity). Refer to the electronic version of the paper for clarity

Figure 3 :3Figure 3: Portion of dendrogram depicting the clusters' hierarchy of topics from the initial window (Number 0). Agglomeration is based on the dot product between the topics' projections on a lower dimensional latent space extracted using NMF with k f = 30 factors. Average-Linkage Agglomerative Hierarchical Clustering was used. Distance is computed as 1 minus similarity. Refer to the electronic version of the paper for clarity.

Figure 4 :4Figure 4: Portion of dendrogram depicting the clusters' hierarchy of topics from the first 6 windows. Agglomeration is based on the dot product between the topics' projections on a lower dimensional latent space extracted using NMF with k f = 30 factors. Average-Linkage Agglomerative Hierarchical Clustering was used. Distance is computed as 1 minus similarity. Refer to the electronic version of the paper for clarity.

Algorithm 44Basic Alternating Least Square (ALS) Algorithm for NMF Input: Data matrix X, number of factors k f Output: optimal matrices A and B 1. Initialize matrix A (for example randomly) 2. Repeat (a) Solve for B in the equation: A T AB = A T X (b) Project solution onto non-negative matrix subspace: set all negative values in B to zeros (c) Solve for A in the equation: BB T A T = BX T (d) Project solution onto non-negative matrix subspace: set all negative values in A to zeros 3. Until Cost function decrease is below threshold

Figure 5 :5Figure 5: Perplexity trends for each sliding window of width three for various numbers of extracted topics.

perplexity score indicating better generalization performance. For a test set of T documents D =

Figure 6 :6Figure 6: Perplexity for different numbers of topics and varying window length, showing improved results when NMF-based automated tweet annotation is performed before topic modeling.

Table 1 :1Description of used variables.

global counts: N wkp ← N wk

4.Sample z p locally:LDAGibbsItr(x p ,z p ,N dkp ,N wkp ,α,β)//Alg: 25. Synchronize6. Update global counts:

SocialSensor: http://www.socialsensor.eu/ JSON: JavaScript Object Notation, is a text-based open standard designed for human-readable data interchange https://github.com/ravikiranj/twitter-sentiment-analyzer https://github.com/linron84/JST

Acknowledgements

We would like to thank the organizers of the SNOW 2014 workshop, in particular the members of the So-cialSensor team for their leadership in all the phases of the competition.

Clustering heterogeneous data sets ArturAbdullin OlfaNasraoui Web Congress (LA-WEB) 2012 Eighth Latin American IEEE 2012 Sensing trending topics in twitter MariaApm + ; Luca GeorgiosAiello CarlosPetkos DavidMartin SymeonCorney RyanPapadopoulos AyseSkraba IoannisGoker AlejandroKompatsiaris Jaimes IEEE Transactions on Multimedia 2013 Correlated topic models DavidBlei JohnLafferty Advances in neural information processing systems 18 147 2006 MDavid JonDBlei Mcauliffe arXiv:1003.0783 Supervised topic models 2010 arXiv preprint Latent dirichlet allocation AndrewYDavid M Blei MichaelINg Jordan the Journal of machine Learning research 3 2003 Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization JaafarJuan C Caicedo FabioABenabdallah OlfaGonzález Nasraoui Neurocomputing 76 1 2012 Online learning for latent dirichlet allocation MatthewHoffman DavidMBlei FrancisBach Advances in Neural Information Processing Systems 23 2010 Stream-dashboard: a framework for mining, tracking and validating clusters in a data stream BasheerHawwash OlfaNasraoui Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications ACM 2012 Calculating feature weights in naive bayes with kullback-leibler measure Chang-HwanLee FernandoGutierrez DejingDou Data Mining (ICDM), 2011 IEEE 11th International Conference on IEEE 2011 Joint sentiment/topic model for sentiment analysis ChenghuaLin YulanHe Proceedings of the 18th ACM conference on Information and knowledge management the 18th ACM conference on Information and knowledge management ACM 2009 Arsa: a sentiment-aware model for predicting sales performance using blogs YangLiu XiangjiHuang AijunAn XiaohuiYu Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval the 30th annual international ACM SIGIR conference on Research and development in information retrieval ACM 2007 Learning the parts of objects by nonnegative matrix factorization DDaniel HLee SeungSebastian Nature 401 6755 1999 Opinion integration through semi-supervised topic modeling YueLu ChengxiangZhai Proceedings of the 17th international conference on World wide web the 17th international conference on World wide web ACM 2008 McC] Mallet: A machine learning for language toolkit Distributed algorithms for topic models DavidNewman ArthurAsuncion PadhraicSmyth MaxWelling The Journal of Machine Learning Research 10 2009 Snow 2014 data challenge: Assessing the performance of news topic detection methods in social media SymeonPapadopoulos DavidCorney Luca MariaAiello Proceedings of the SNOW 2014 Data Challenge the SNOW 2014 Data Challenge 2014 Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values PenttiPaatero UntoTapper Environmetrics 5 2 1994 Identifying and verifying news through social media: Developing a user-centered tool for professional journalists NSnt + ; S Schifferes NNewman DThurman ASCorney Goker Martin Digital Journalism 2014 Modeling online reviews with multi-grain topic models IvanTitov RyanMcdonald Proceedings of the 17th international conference on World Wide Web the 17th international conference on World Wide Web ACM 2008 Annotating expressions of opinions and emotions in language JanyceWiebe TheresaWilson ClaireCardie Language resources and evaluation 39 2-3 2005 Efficient methods for topic model inference on streaming document collections LiminYao DavidMimno AndrewMccallum Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining the 15th ACM SIGKDD international conference on Knowledge discovery and data mining ACM 2009 Online topic models with infinite vocabulary KeZhai JordanBoyd-Graber International Conference on Machine Learning 2013