<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Distributed LDA based Topic Modeling and Topic Agglomeration in a Latent Space</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Chand</forename><surname>Gopi</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Knowlede Discovery &amp; Web Mining Lab</orgName>
								<orgName type="institution">University of Louisville</orgName>
							</affiliation>
						</author>
						<author>
							<persName><surname>Nutakki</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Knowlede Discovery &amp; Web Mining Lab</orgName>
								<orgName type="institution">University of Louisville</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Olfa</forename><surname>Nasraoui</surname></persName>
							<email>olfa.nasraoui@louisville.edu</email>
							<affiliation key="aff1">
								<orgName type="laboratory">Knowlede Discovery &amp; Web Mining Lab</orgName>
								<orgName type="institution">University of Louisville</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Behnoush</forename><surname>Abdollahi</surname></persName>
							<affiliation key="aff2">
								<orgName type="laboratory">Knowlede Discovery &amp; Web Mining Lab</orgName>
								<orgName type="institution">University of Louisville</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mahsa</forename><surname>Badami</surname></persName>
							<email>m0bada01@louisville.edu</email>
							<affiliation key="aff3">
								<orgName type="laboratory">Knowlede Discovery &amp; Web Mining Lab</orgName>
								<orgName type="institution">University of Louisville</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Wenlong</forename><surname>Sun</surname></persName>
							<affiliation key="aff4">
								<orgName type="laboratory">Knowlede Discovery &amp; Web Mining Lab</orgName>
								<orgName type="institution">University of Louisville</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Distributed LDA based Topic Modeling and Topic Agglomeration in a Latent Space</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9EB53C60EA608F5C6FA08DCCF1E9ECE2</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T09:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We describe the methodology that we followed to automatically extract topics corresponding to known events provided by the SNOW 2014 challenge in the context of the SocialSensor project. A data crawling tool and selected filtering terms were provided to all the teams. The crawled data was to be divided in 96 (15-minute) timeslots spanning a 24 hour period and participants were asked to produce a fixed number of topics for the selected timeslots. Our preliminary results are obtained using a methodology that pulls strengths from several machine learning techniques, including Latent Dirichlet Allocation (LDA) for topic modeling and Non-negative Matrix Factorization (NMF) for automated hashtag annotation and for mapping the topics into a latent space where they become less fragmented and can be better related with one another. In addition, we obtain improved topic quality when</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The SNOW 2014 challenge was organized within the context of the SocialSensor project<ref type="foot" target="#foot_0">1</ref> , which works on developing a new framework for enabling real-time multimedia indexing and search in the Social Web. The aim of the challenge was to automatically extract topics corresponding to known events that were prescribed by the challenge organizers. Also provided, was a data crawling tool along with several Twitter filter terms (syria, ukraine, bitcoin, terror). The crawled data was to be divided in a total of 96 (15-minute) timeslots spanning a 24 hour period, with a goal of extracting a fixed number of topics in each timeslot. Only tweets up to the end of the timeslot could be used to extract any topic. In this paper, we focuse on the topic extraction task, instead of input data filtering, or presentation of associated headline, tweets and image URL, because this was one of the activities closest to the ongoing research [AN12, HN12, CBGN12] on multi-domain data stream clustering in the Knowledge Discovery &amp; Web Mining Lab at the University of Louisville. To extract topics from the tweets crawled in each time slot, we use a Latent Dirichlet Allocation (LDA) based technique. We then discover latent concepts using Non-negative Matrix Factorization (NMF) on the resulting topics, and apply hierarchical clustering within the resulting Latent Space (LS) in order to agglomerate these topics into less fragmented themes that can facilitate the visual inspection of how the different topics are inter-related. We have also experimented with adding a sentiment detection step prior to topic modeling in order to obtain a polarity sensitive topic discovery, and automated hashtag annotation to improve the topic extraction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Latent Dirichlet Allocation</head><p>Latent Dirichlet Allocation (LDA) is a Bayesian probabilistic model for text documents. It assumes a collection of K topics where each topic defines a multinomial over the vocabulary, which is assumed to have been drawn from a Dirichlet process <ref type="bibr" target="#b5">[BNJ03]</ref> <ref type="bibr" target="#b7">[HBB10]</ref>. Given the topics, LDA assumes the generative process for each document d, shown in Algorithm 1, where the notation is listed in Table <ref type="table" target="#tab_0">1</ref>. Equation 1 gives the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w for parameters α and β.</p><formula xml:id="formula_0">p (θ, z, w|α, β) = p (θ|α) N n=1 p (z n |θ) p (w n |z n , β) (1)</formula><p>Integrating over θ and summing over z, we obtain the marginal distribution of a document <ref type="bibr" target="#b5">[BNJ03]</ref>:</p><formula xml:id="formula_1">p (w|α, β) = ˆp (θ|α) Π N n=1 zn p (z n |θ) p (w n |z n , β) dθ</formula><p>Taking the product of the marginal probabilities of single documents, the probability of a corpus D can be obtained:  The posterior is usually approximated using Markov Chain Monte Carlo (MCMC) methods or variational inference. Both methods are effective, but face significant computational challenges in the face of massive data sets. For this reason, we concentrated on a distributed version of LDA which is summarized in the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Distributed Algorithms for LDA</head><p>It is possible to distribute non-collapsed Gibbs sampling, because sampling of z di can happen independently given θ d and φ k , and thus can be done concurrently. In a non-collapsed Gibbs sampler, one samples z di given θ d and φ k , and then θ d and φ k given z di . If individual documents are not spread across different processors, one can marginalize over just θ d , since θ d is processor-specific. In this partially collapsed scheme, the latent variables z di on each processor can be concurrently sampled where the concurrency is over processors. The slow convergence of partially collapsed and non-collapsed Gibbs samplers (due to the strong dependencies between the parameters and latent variables) has led to devising distributed algorithms for fully collapsed Gibbs samplers [NASW09] <ref type="bibr" target="#b20">[YMM09]</ref>.</p><p>Given M documents and P processors, with approximately M P = M P documents, distributed on each processor p, the M documents are partitioned into</p><formula xml:id="formula_2">x = {x 1 , • • • , x p , • • • , x P } and z = {z 1 , • • • , z p , • • • , z P }</formula><p>being the corresponding topic assignments, where processor p stores x p , the words from documents j = (p − 1) M P + 1, • • • , pM P and z P , the corresponding topic assignments. Topic-document counts N dk are likewise distributed as N dkp . The word-topic counts N wk are also distributed, with each processor p keeping a separate local copy N wkp .</p><formula xml:id="formula_3">Algorithm 2 Standard Collapsed Gibbs Sampling. LDAGibbsItr( |x p |, z p , N dkp , N wkp , α, β): 1. For Each d ∈ {1, • • • , M } 2. For Each i ∈ {1, • • • , N dkp } 3. v ← x dpi , T dpi ← N dkpi 4. For Each j ∈ {1, • • • , T dkpi } 5. k ← z dpij 6. N dkp ← N dkp − 1, N wkp ← N wkp − 1 7. For k = 1 to K 8. ρ k ← ρ k−1 + (N dkp + α) × (N kwp + β) / ( w N w k ) + N β 9. x ∼ U nif ormDistribution(0, ρ k ) 10. k ← BinarySearch k : ρ k−1 &lt; x &lt; ρ k 11. N d kp ← N d kp + 1,N w kp ← N w kp + 1 12. z dpij ← k</formula><p>Although Gibbs sampling is a sequential process, given the typically large number of word tokens compared to the number of processors, the dependence of z ij on the update of any other topic assignment z i j is likely to be weak, thus relaxing the sequential sampling constraint. If two processors are concurrently sampling, but with different words in different documents, then concurrent sampling will approximate sequential sampling. This is because the only term affecting the order of the update operations is the total word-topic counts w N wk . Algorithm 3 shows the pseudocode Algorithm 3 Approximate Distributed LDA <ref type="bibr" target="#b14">[NASW09]</ref>.</p><formula xml:id="formula_4">Input: A list of M documents, x = {x 1 , • • • , x p , • • • , x P } Output: z = {z 1 , • • • , z p , • • • , z P } 1. Repeat 2.</formula><p>For each processor p in parallel do</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>Copy </p><formula xml:id="formula_5">N wk ← N wk + p (N wkp − N wk )</formula><p>7. Until termination criterion is satisfied of the AD-LDA algorithm which can terminate after a fixed number of iterations, or based on a suitable MCMC convergence metric. The AD-LDA algorithm samples from an approximation to the posterior distribution by allowing different processors to concurrently sample topic assignments on their local subsets of the data. AD-LDA works well empirically and accelerates the topic modeling process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Topic Extraction Methodology</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Data Preprocessing</head><p>The dataset consists of tweets that were acquired from the Twitter servers by continuous querying using a wrapper for the Twitter API over a period of 24 hours. The batch of tweets are acquired in raw JSON<ref type="foot" target="#foot_1">2</ref> format. Various properties of the tweet such as the hashtags, URLs, creation time, counts for retweets and favorites, and other user information including the encoding and language are extracted. The hashtags can provide a good source for creating discriminating features and they were folded as terms into the bag of words model for each tweet where they were present (without the '#' prefix). The URLs can also later provide a method to achieve topic summarization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Topic Extraction Stages</head><p>The technique assumes a real time streaming data input and is replicated using process calls to the storage records containing the tweets. For AD-LDA, each Refer to the electronic version of the paper for clarity tweet is considered as a single document. Figure <ref type="figure" target="#fig_0">1</ref> shows the steps performed to extract the topics in each window or time slot. The procedure starts with the extraction of key information from the Twitter JSON, then the tweet text and other properties are used to extract topics. The topic extraction is performed using the following steps:</p><p>1. The documents are stripped of non-English characters and are converted to lowercase. The stop words are retained for the context information (especially for sentiment detection).</p><p>2. Groups of documents are assembled into windows based on their timestamp. A sliding window's width is equal to three consecutive time slots ending in the current time slot.</p><p>3. AD-LDA technique is performed on a 20 node cluster. From each sliding window iteration a total of 1000 topics are extracted, this higher value help in extracting finer topics.</p><p>4. The topics are ranked based on the proportion of tweets assigned to the topic in the given window, and then can be clustered/merged together to organize them into more general topic groups.</p><p>The jsoup open source HTML parser 3 was used to extract multimedia content such as images and metadata from the URLs extracted from the tweets. The headlines are part of the metadata while the keywords are obtained from the topic modeling itself as the terms with highest probability in the topic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Topic Extraction with AD-LDA and Sentiment Labels</head><p>The AD-LDA technique with Gibbs sampling, along with automatically extracted sentiment labels can also be used to extract polarity-sensitive topics. Using sentiment labels may improve the quality of the topics as it results in finer topics. Figure <ref type="figure" target="#fig_0">1</ref>  Bayes classifier <ref type="bibr" target="#b9">[LGD11]</ref> trained with labeled tweet samples<ref type="foot" target="#foot_2">4</ref> and a set of labeled tokens<ref type="foot" target="#foot_3">5</ref> with known sentiment polarity can be used to extract the sentiment levels. The tweets are then regrouped based on the sentiment level and the topic modeling is applied on each group, resulting in topics that are confined to one sentiment, as illustrated in Table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Topic Agglomeration in a Latent Space</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Discovering Latent Factors Among the Discovered Topics Using Non-negative Matrix Factorization (NMF)</head><p>Because initial topic modeling generated a high number of topics (1000 topics per window), that were furthermore very sparse in terms of the descriptive terms within them, these topics were hard to interpret and could benefit from a coarser, less fragmented organization. One way to fix this problem was to merge the topics based on a conceptual similarity by applying Non-negative Matrix Factorization (NMF) <ref type="bibr" target="#b12">[LS99]</ref>.</p><p>Because the topics by words matrix is a very sparse matrix, we used NMF to project the topics onto a common lower-dimensional latent factors' space. NMF takes as input the matrix X of n topics by m words (as binary features) and decomposes it into two factor matrices (A and B) which represent the topics and words, respectively, in a k f -dimensional latent space, as follows:</p><formula xml:id="formula_6">X n×m A n×k f B T m×k f (2)</formula><p>Positive Sentiment</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Negative Sentiment optimistic ukraine antiwar nonintervention horrible building badge hiding ukraine yanukovych syria refugees about education children million syria yarmouk camp crisis food waiting unrest shocking future technology bitcoins value law accelerating cnn protocols loss gox bitcoin fault</head><p>Table <ref type="table">2</ref>: Illustrating a sample of the finer topics extracted after a preliminary sentiment detection phase. where k f is the approximated rank of matrices A and B, and is selected such that k f &lt; min(m, n), so that the number of elements in the decomposition matrices is far less than the number of elements of the original matrix:</p><formula xml:id="formula_7">nk f + k f m nm.</formula><p>Topics factor (A) can then be used to find the similarity between the topics in the new latent space instead of using the original space of original terms. the obtained similarity matrix from the NMF factors can finally be used to cluster the topics.</p><p>To find A and B, the Frobenius norm of errors between the data and the approximation is optimized, as follows</p><formula xml:id="formula_8">J N M F = ||E|| 2 F = || X − AB T || 2 F (3)</formula><p>Several algorithms have been proposed in the literature to minimize this cost. We used an Alternating Least Square (ALS) method <ref type="bibr" target="#b16">[PT94]</ref> that iteratively solves for the factors, by assuming that the problem is convex in either one of the factor matrices alone. In the following, we summarize the steps that are applied post-discovery of the topics, in order to generate a hierarchical organization from the sparse topics.</p><p>1. Preprocessing of the topic vectors: For each window, the topic-word matrix (X n×m ) is extracted from the final topic modeling results. The features are the top words in a topic, and they are binary (1 if a topic has the word in question and 0 otherwise).</p><p>2. Latent Factor Discovery using NMF : The topicword data was normalized before running NMF. The latter produces two factors (A and B), where n, k f and m are the number of topics, latent factors and words, respectively. Our main goal was to compute the Matrix A, also called the topics basis factor, which transfers the topics to the latent space. Choosing the number of factors), k f , has an impact on the results,. After trial and error, we chose k f = 30.</p><p>3. Generating the topic-similarity matrix in the la- tent space: The computed topic basis matrix (A) was used to obtain the similarity in lieu of the original topic vectors. The normalized inner product of the matrix A and its transpose was calculated for this purpose. Normalization of the product is equivalent to computing the Cosine similarity between topic pairs. The resulting matrix contains the pairwise similarity between each pair of topics within the latent space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Hierarchical Clustering of the latent spaceprojected topics based on the new pairwise similarity scores computed in</head><p>Step 3 : we experimented with several linkage strategies such as single and average linkage. The latter was chosen as optimal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Automated Hashtag Annotation</head><p>We have also experimented with a simple tag completion or prediction step prior to topic modeling. Annotation for a given tweet is determined by finding the top frequent tags associated with the K LS 6 nearest neighboring tweets in the NMF-computed Latent Space to the given tweet. Once the tags are completed, they are used to enrich the tweets before topic modeling. Of course, only the tweets bag of word descriptions in a given window are used to compute the NMF for that window's topic modeling. The annotation generally resulted in lower Perplexity of the extracted topic models, as shown in Figure <ref type="figure" target="#fig_8">6</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Distributed LDA-based Topic Modeling</head><p>Figure <ref type="figure" target="#fig_2">2</ref> shows 7 a sample of the topic clusters' hierarchy extracted from the initial window and without NMF-based latent space projection of the topics. The clusters are of debatable quality. Perplexity is a common metric to evaluate language models [BL06] <ref type="bibr" target="#b5">[BNJ03]</ref>. It is monotonically decreasing in the likelihood of the test data, with a lower 6 we report results for K LS = 5 7 Refer to the electronic version of the paper for clarity. and N d being the total number of keywords in d th document, the perplexity given in Equation 4, will be lower for a better topic model. Figure <ref type="figure" target="#fig_6">5</ref> shows the perplexity trends, suggesting that more topics result in lower (thus better) perplexity. Also, irrespective of the number of topics, AD-LDAbased topic modeling can extract topics of good quality.</p><formula xml:id="formula_9">perplexity (D ) = exp        − T d=1 ln p → w (d) | − → α , β T d=1 N d        (4)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Sentiment Based Topic Modeling</head><p>Table <ref type="table">2</ref> shows a subset of topics, extracted from the positive and negative sentiment groups of tweets, and these tend to be more refined than the standard unsentimental topics. From the initial window, 1000 topics were extracted in the same way as the Distributed LDA, however topic modeling was preceded by a sentiment classifier that classifies the tweets based on their sentiment (positive or negative). Although positive and negative topics still share a few keywords, they are clearly divided by sentiment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Topic Clustering in the Latent Space</head><p>Figure <ref type="figure" target="#fig_3">3</ref> shows the topic clusters created using the latent space-projected features extracted using NMF. The clusters in Figure <ref type="figure" target="#fig_3">3</ref> seem to have better quality compared to the clusters in Figure <ref type="figure" target="#fig_2">2</ref> because of the more accurate capture of pairwise similarities between topics in the conceptual space. Figure <ref type="figure" target="#fig_4">4</ref> shows the clustering of the top 10 topics for a series of 6 windows, showing how the agglomeration can consolidate the topics discovered at different time slots, helping avoid excessive fragmentation throughout the stream's life.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Automated Hashtag Annotation</head><p>Tweet data is very sparse and not every tweet has valuable tags. To overcome this weakness, we applied an NMF-based automated tweet annotation before topic modeling. Adding the predicted hashtags to the tweets enhanced the topic modeling. The automated tag annotation, described in Section 5.3, generally resulted in lower Perplexity of the extracted topic models, as shown in Figure <ref type="figure" target="#fig_8">6</ref>, suggesting that the auto-completed tags did help complete some missing and valuable information in the sparse tweet data, thus helping the topic modeling. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>Using Distributed LDA topic modeling, followed by NMF and hierarchical clustering within the resulting Latent Space (LS), helped organize the topics into less fragmented themes. Sentiment detection prior to topic modeling and automated hashtag annotation helped improve the learned topic models, while the agglomeration of topics across several time windows can link the topics discovered at different time windows. Our focus was on the topic modeling and organization using the simplest (bag of words) features. Specialized twitter feature extraction and selection methods, such as the ones surveyed and proposed by Aiello et al. [APM + 13], have the potential to improve our results, a direction we will explore in the future. Another direction to explore is the news domain specific, user-centered approach, discussed in [SNT + 14] and a more expanded use of automated annotation to support topic extraction and description.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Topic Modeling Framework (sentiment detection and hashtag annotation are not shown).</figDesc><graphic coords="2,64.80,54.07,244.79,140.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>x di i th observed word in document d z di Topic assigned to x di N wk Count of word assigned to topic N dk Count of topic assigned in document φ k Probability of word given topic k θ d Probability of topic given document d α,β Dirichlet priors Algorithm 1 Latent Dirichlet Allocation. Input: A document collection, hyper-parameters α and β. Output: A list of topics. 1. Draw a distribution over topics, θ d ∼ Dir(α) 2. For Each word i in the document: 3. Draw a topic index z di ∈ {1, • • • , K} from the topic weights z di ∼ θ d . 4. Draw the observed word w di from the selected topic, w di ∼ β z di p (D|α, β) = Π M d=1 ˆp (θ d |α) Π N d n=1 z dn p (z dn |θ d ) p (w dn |z dn , β) dθ d</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Dendrogram depicting few clusters' hierarchy of topics from the initial window. Agglomeration is based on the cosine similarity. Average-Linkage Agglomerative Hierarchical Clustering was used. Distance is computed as (1 − similarity). Refer to the electronic version of the paper for clarity</figDesc><graphic coords="4,73.80,54.07,216.00,57.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Portion of dendrogram depicting the clusters' hierarchy of topics from the initial window (Number 0). Agglomeration is based on the dot product between the topics' projections on a lower dimensional latent space extracted using NMF with k f = 30 factors. Average-Linkage Agglomerative Hierarchical Clustering was used. Distance is computed as 1 minus similarity. Refer to the electronic version of the paper for clarity.</figDesc><graphic coords="4,325.80,54.07,216.00,139.86" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Portion of dendrogram depicting the clusters' hierarchy of topics from the first 6 windows. Agglomeration is based on the dot product between the topics' projections on a lower dimensional latent space extracted using NMF with k f = 30 factors. Average-Linkage Agglomerative Hierarchical Clustering was used. Distance is computed as 1 minus similarity. Refer to the electronic version of the paper for clarity.</figDesc><graphic coords="5,73.80,125.63,216.00,147.90" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Algorithm 4</head><label>4</label><figDesc>Basic Alternating Least Square (ALS) Algorithm for NMF Input: Data matrix X, number of factors k f Output: optimal matrices A and B 1. Initialize matrix A (for example randomly) 2. Repeat (a) Solve for B in the equation: A T AB = A T X (b) Project solution onto non-negative matrix subspace: set all negative values in B to zeros (c) Solve for A in the equation: BB T A T = BX T (d) Project solution onto non-negative matrix subspace: set all negative values in A to zeros 3. Until Cost function decrease is below threshold</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Perplexity trends for each sliding window of width three for various numbers of extracted topics.</figDesc><graphic coords="6,73.80,54.07,216.00,93.76" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head></head><label></label><figDesc>perplexity score indicating better generalization performance. For a test set of T documents D =</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Perplexity for different numbers of topics and varying window length, showing improved results when NMF-based automated tweet annotation is performed before topic modeling.</figDesc><graphic coords="7,73.80,54.07,215.99,82.44" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Description of used variables.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>global counts: N wkp ← N wk</figDesc><table><row><cell>4.</cell><cell>Sample z p locally:</cell></row><row><cell></cell><cell>LDAGibbsItr(x p ,z p ,N dkp ,N wkp ,α,β)</cell><cell>//</cell></row><row><cell></cell><cell>Alg: 2</cell></row><row><cell cols="2">5. Synchronize</cell></row><row><cell cols="2">6. Update global counts:</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">SocialSensor: http://www.socialsensor.eu/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">JSON: JavaScript Object Notation, is a text-based open standard designed for human-readable data interchange</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">https://github.com/ravikiranj/twitter-sentiment-analyzer</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">https://github.com/linron84/JST</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Acknowledgements</head><p>We would like to thank the organizers of the SNOW 2014 workshop, in particular the members of the So-cialSensor team for their leadership in all the phases of the competition.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Clustering heterogeneous data sets</title>
		<author>
			<persName><forename type="first">Artur</forename><surname>Abdullin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Olfa</forename><surname>Nasraoui</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Web Congress (LA-WEB)</title>
				<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m">Eighth Latin American</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Sensing trending topics in twitter</title>
		<author>
			<persName><forename type="first">Maria</forename><surname>Apm + ; Luca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Georgios</forename><surname>Aiello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carlos</forename><surname>Petkos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Symeon</forename><surname>Corney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ryan</forename><surname>Papadopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ayse</forename><surname>Skraba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ioannis</forename><surname>Goker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alejandro</forename><surname>Kompatsiaris</surname></persName>
		</author>
		<author>
			<persName><surname>Jaimes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Multimedia</title>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Correlated topic models</title>
		<author>
			<persName><forename type="first">David</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">John</forename><surname>Lafferty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="page">147</biblScope>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>David</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jon</forename><forename type="middle">D</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><surname>Mcauliffe</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1003.0783</idno>
		<title level="m">Supervised topic models</title>
				<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">Andrew</forename><forename type="middle">Y</forename><surname>David M Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><forename type="middle">I</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><surname>Jordan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">the Journal of machine Learning research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="993" to="1022" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization</title>
		<author>
			<persName><forename type="first">Jaafar</forename><surname>Juan C Caicedo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fabio</forename><forename type="middle">A</forename><surname>Benabdallah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Olfa</forename><surname>González</surname></persName>
		</author>
		<author>
			<persName><surname>Nasraoui</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neurocomputing</title>
		<imprint>
			<biblScope unit="volume">76</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="50" to="60" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Online learning for latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">Matthew</forename><surname>Hoffman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Francis</forename><surname>Bach</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="856" to="864" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Stream-dashboard: a framework for mining, tracking and validating clusters in a data stream</title>
		<author>
			<persName><forename type="first">Basheer</forename><surname>Hawwash</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Olfa</forename><surname>Nasraoui</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications</title>
				<meeting>the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="109" to="117" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Calculating feature weights in naive bayes with kullback-leibler measure</title>
		<author>
			<persName><forename type="first">Chang-Hwan</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fernando</forename><surname>Gutierrez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dejing</forename><surname>Dou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Data Mining (ICDM), 2011 IEEE 11th International Conference on</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1146" to="1151" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Joint sentiment/topic model for sentiment analysis</title>
		<author>
			<persName><forename type="first">Chenghua</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yulan</forename><surname>He</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th ACM conference on Information and knowledge management</title>
				<meeting>the 18th ACM conference on Information and knowledge management</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="375" to="384" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Arsa: a sentiment-aware model for predicting sales performance using blogs</title>
		<author>
			<persName><forename type="first">Yang</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiangji</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aijun</forename><surname>An</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaohui</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>the 30th annual international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="607" to="614" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Learning the parts of objects by nonnegative matrix factorization</title>
		<author>
			<persName><forename type="first">D</forename><surname>Daniel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Seung</forename><surname>Sebastian</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature</title>
		<imprint>
			<biblScope unit="volume">401</biblScope>
			<biblScope unit="issue">6755</biblScope>
			<biblScope unit="page" from="788" to="791" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Opinion integration through semi-supervised topic modeling</title>
		<author>
			<persName><forename type="first">Yue</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chengxiang</forename><surname>Zhai</surname></persName>
		</author>
		<ptr target="http://www.cs.umass.edu/mccal-lum/mallet" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th international conference on World wide web</title>
				<meeting>the 17th international conference on World wide web</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="121" to="130" />
		</imprint>
	</monogr>
	<note>McC] Mallet: A machine learning for language toolkit</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Distributed algorithms for topic models</title>
		<author>
			<persName><forename type="first">David</forename><surname>Newman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Arthur</forename><surname>Asuncion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Padhraic</forename><surname>Smyth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Max</forename><surname>Welling</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="1801" to="1828" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Snow 2014 data challenge: Assessing the performance of news topic detection methods in social media</title>
		<author>
			<persName><forename type="first">Symeon</forename><surname>Papadopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Corney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luca Maria</forename><surname>Aiello</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the SNOW 2014 Data Challenge</title>
				<meeting>the SNOW 2014 Data Challenge</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values</title>
		<author>
			<persName><forename type="first">Pentti</forename><surname>Paatero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Unto</forename><surname>Tapper</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Environmetrics</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="111" to="126" />
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Identifying and verifying news through social media: Developing a user-centered tool for professional journalists</title>
		<author>
			<persName><forename type="first">N</forename><surname>Snt + ; S Schifferes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Newman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Thurman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Corney</surname></persName>
		</author>
		<author>
			<persName><surname>Goker</surname></persName>
		</author>
		<author>
			<persName><surname>Martin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Digital Journalism</title>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Modeling online reviews with multi-grain topic models</title>
		<author>
			<persName><forename type="first">Ivan</forename><surname>Titov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ryan</forename><surname>Mcdonald</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th international conference on World Wide Web</title>
				<meeting>the 17th international conference on World Wide Web</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="111" to="120" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Annotating expressions of opinions and emotions in language</title>
		<author>
			<persName><forename type="first">Janyce</forename><surname>Wiebe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Theresa</forename><surname>Wilson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claire</forename><surname>Cardie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language resources and evaluation</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">2-3</biblScope>
			<biblScope unit="page" from="165" to="210" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Efficient methods for topic model inference on streaming document collections</title>
		<author>
			<persName><forename type="first">Limin</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Mimno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Mccallum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</title>
				<meeting>the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="937" to="946" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Online topic models with infinite vocabulary</title>
		<author>
			<persName><forename type="first">Ke</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jordan</forename><surname>Boyd-Graber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
