<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Ensemble Topic Modeling via Matrix Factorization</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mark</forename><surname>Belford</surname></persName>
							<email>mark.belford@insight-centre.org</email>
							<affiliation key="aff0">
								<orgName type="department">Insight Centre for Data Analytics</orgName>
								<orgName type="institution">University College Dublin</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Brian</forename><forename type="middle">Mac</forename><surname>Namee</surname></persName>
							<email>brian.macnamee@ucd.ie</email>
							<affiliation key="aff0">
								<orgName type="department">Insight Centre for Data Analytics</orgName>
								<orgName type="institution">University College Dublin</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Derek</forename><surname>Greene</surname></persName>
							<email>derek.greene@ucd.ie</email>
							<affiliation key="aff0">
								<orgName type="department">Insight Centre for Data Analytics</orgName>
								<orgName type="institution">University College Dublin</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Ensemble Topic Modeling via Matrix Factorization</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">46C6366D682D49464A2C471E5BD1B530</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T14:13+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents, facilitating knowledge discovery and information summarization. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, these methods tend to have stochastic elements in their initialization, which can lead to their output being unstable. That is, if a topic modeling algorithm is applied to the same data multiple times, the output will not necessarily always be the same. With this idea of stability in mind we ask the questionhow can we produce a definitive topic model that is both stable and accurate? To address this, we propose a new ensemble topic modeling method, based on Non-negative Matrix Factorization (NMF), which combines a collection of unstable topic models to produce a definitive output. We evaluate this method on an annotated tweet corpus, where we show that this new approach is more accurate and stable than traditional NMF.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Topic models aim to discover the latent semantic structure or topics within a corpus of documents, which can be derived from co-occurrences of words across the documents. Popular approaches for topic modeling have involved the application of probabilistic algorithms such as Latent Dirichlet Allocation (LDA) <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b14">15]</ref>, and also, more recently, matrix factorization algorithms <ref type="bibr" target="#b18">[19]</ref>. In both cases, these algorithms include stochastic elements in their initialization phase, prior to the optimization phase. This random component can affect the final composition of the topics and the rankings of the terms that describe those topics. This is problematic when seeking to capture a definitive topic modeling solution for a given corpus. Such issues represent a fundamental instability in these algorithms -different runs of the same algorithm on the same data can produce different outcomes <ref type="bibr" target="#b7">[8]</ref>. Most authors do not address this issue and instead simply utilize a single random initialization and present the results of the topic model as being definitive. Another challenge in topic modeling is the identification of coherent topics using noisy texts, such as tweets <ref type="bibr" target="#b0">[1]</ref>. The noisy and sparse nature of this data makes topic modeling more difficult when compared to analyzing longer, cleaner texts such as political speeches or news articles.</p><p>Here we consider the idea of ensemble learning, the rationale for which is that the combined judgment of a group of algorithms will often be superior to that of an individual <ref type="bibr" target="#b3">[4]</ref>. Such techniques have been well-established for both supervised classification tasks <ref type="bibr" target="#b12">[13]</ref> and also for unsupervised cluster analysis tasks <ref type="bibr" target="#b16">[17]</ref>. In the case of the latter, the goal is to produce a more accurate or useful clustering of the data, which also avoids the issue of instability which is inherent in algorithms such as k-means. The application of unsupervised ensembles generally involves two distinct stages: 1) the generation of a collection of different clusterings of the data; 2) the integration of these clusterings to yield a single more accurate, informative clustering of the data. A variety of different strategies for both generation and integration have been proposed in the literature <ref type="bibr" target="#b6">[7]</ref>.</p><p>In this paper we propose an ensemble method for topic modeling, based on the generation and integration of the results produced by multiple runs of Nonnegative Matrix Factorization (NMF) <ref type="bibr" target="#b10">[11]</ref> on the same corpus. The integration aspect of the algorithm builds on previous work involving the combination of topics from different time periods with NMF <ref type="bibr" target="#b9">[10]</ref>. To evaluate this method, we make use of a new Twitter corpus, the 20-topics dataset, which provides partial ground truth annotations for user accounts. The results on this data indicate that the combination of many diverse models into a single ensemble topic model produces a more definitive and stable solution, when compared with randomly initialized NMF.</p><p>The paper is structured as follows. In Section 2 we explore related work in the areas of topic modeling and ensemble clustering. In Section 3 we describe how the two step process of our ensemble method works, before evaluating this new method in comparison to randomly initialized NMF in Section 4. Finally in Section 5 we conclude the paper with ideas for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>In this section we will examine related work regarding topic modeling and the popular algorithms that are employed frequently in the field. We will also look briefly at ensemble clustering and the two main phases involved as outlined by previous literature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Topic Modeling</head><p>Topic models attempt to discover the underlying thematic structure within a text corpus without relying on any form of training data. These models date back to the early work on latent semantic indexing by <ref type="bibr" target="#b4">[5]</ref>, which proposed the decomposition of term-document matrices for this purpose using Singular Value Decomposition <ref type="bibr" target="#b2">[3]</ref>. A topic model typically consists of k topics, each represented by a ranked list of strongly-associated terms (often referred to as a "topic descriptor"). Each document in the corpus can also be associated with one or more topics. Considerable research on topic modeling has focused on the use of probabilistic methods, where a topic is viewed as a probability distribution over words, with documents being mixtures of topics, thus permitting a topic model to be considered a generative model for documents <ref type="bibr" target="#b14">[15]</ref>. The most widely-applied probabilistic topic modeling approach is Latent Dirichlet Allocation (LDA) <ref type="bibr" target="#b1">[2]</ref>.</p><p>Alternative algorithms, such as Non-negative Matrix Factorization (NMF) <ref type="bibr" target="#b10">[11]</ref>, have also been effective in discovering the underlying topics in text corpora <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b18">19]</ref>. NMF is an unsupervised approach for reducing the dimensionality of non-negative matrices. When working with a document-term matrix A, the goal of NMF is to approximate this matrix as the product of two non-negative approximate factors W and H, each with k dimensions. These dimensions can be interpreted as k topics. Like LDA, the number of topics k to generate is chosen beforehand. The values in H provide term weights which can be used to generate topic descriptions, while the values in W provide topic memberships for documents. One of the advantages of NMF methods over existing LDA methods is that there are fewer parameter choices involved in the modeling process. Typically NMF is initialized by populating W and H with random values before applying the optimization process. As noted previously, this can lead to different solutions of the two factors when applied to the same input matrix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Ensemble Clustering</head><p>In the machine learning literature, it has been shown that combining the strengths of a diverse set of clusterings can often yield more accurate and stable solutions <ref type="bibr" target="#b15">[16]</ref>. Such ensemble clustering approaches typically involve two phases: a generation phase where a collection of "base" clusterings are produced, and an integration phase where an aggregation function is applied to the ensemble members to produce a consensus solution. Generation often involves repeatedly applying a "base" algorithm with a stochastic element to different samples selected at random from a larger dataset. The most frequently employed integration strategy has been to use the information provided by an ensemble to determine the level of association between pairs of objects in a dataset <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b5">6]</ref>. The fundamental assumption underlying this strategy is that pairs belonging to the same natural class will frequently be co-assigned during repeated executions of the base clustering algorithm. Other strategies have involved matching together similar clusters from different runs of the base algorithm.</p><p>While most of this work has focused on producing disjoint clusterings (i.e. each item in the dataset can only belong to a single cluster), researchers have considered combining probabilistic clusterings <ref type="bibr" target="#b13">[14]</ref> and factorizations produced via NMF <ref type="bibr" target="#b8">[9]</ref>. In the latter case, the approach was applied to identify hierarchical structures in biological network data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methods</head><p>In this section we will give a brief overview of how our proposed two step ensemble approach operates before delving deeper into how each of these steps work in greater detail. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Overview</head><p>In this section we propose a new method for topic modeling, which involves applying ensemble learning in the form of two layers of NMF, in order to produce a stable and accurate final set of topics. This method builds on previous work on dynamic topic modeling involving the combination of topics from different time periods <ref type="bibr" target="#b9">[10]</ref>. Fig. <ref type="figure" target="#fig_0">1</ref> shows an overview of the method, which can naturally be divided into two steps, following previous ensemble approaches:</p><p>1. Ensemble generation: Create a set of base topic models by executing multiple runs of NMF applied to the same document-term matrix A. 2. Ensemble integration: Transform the base topic models to a suitable intermediate representation, and apply a further run of NMF to produce a single ensemble topic model, which represents the final output of the method.</p><p>We now discuss each of these steps in more detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Ensemble Generation</head><p>Unsupervised ensemble procedures typically seek to encourage diversity with a view to improving the quality of the information available in the integration phase <ref type="bibr" target="#b17">[18]</ref>. Therefore, in the first step of our approach, we create a diverse set of r base topic models -i.e. the topic term descriptors and document assignments will differ from one base model to another. Here we encourage diversity by relying on the inherent instability of NMF with random initialization -we generate each base model by populating the factors W and H with values based on a different random seed, and then applying NMF to A. In each case we use a fixed prespecified value for the number of topics k. After each run, the H factor from the base topic model (i.e. the topic-term weight matrix) is stored for later use. Note that in our experiments we use the fast alternating least squares implementation of NMF introduced by Lin <ref type="bibr" target="#b11">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Ensemble Integration</head><p>Once we have generated a collection of r factorizations, in the second step we create a new representation of our corpus in the form of a topic-term matrix M.</p><p>The matrix is created by stacking the transpose of each H factor generated in the first step. It is important to note that this process of combining the factors is order independent. This results in a matrix where each row corresponds to a topic from one of the base topic models, and each column is a term from the original corpus. Each entry M ij holds the weight of association for term i in relation to a single topic from a base model. To standardize the range of the values, we apply L2 normalization to the columns of M.</p><p>Once we have created M, we apply the second layer of NMF to this matrix to produce the final ensemble topic model. The reasoning behind applying NMF a second time to these topic descriptors is that they explicitly capture the variance between the base topic models. To improve the quality of the resulting topics, we generate initial factors using the popular Non-negative Double Singular Value Decomposition (NNDSVD) initialisation approach of <ref type="bibr" target="#b2">[3]</ref>. As an input parameter to NMF, we specify a final number of k topics. While this value can be set to be the same as the number of topics k in the base models, in practice we observe that an appropriate value of k may be larger than this due to the ensemble approach being able to capture topics that only appear intermittently among a diverse set of base topic models. The resulting H factor provides weights for the terms for each of the k ensemble topics -the top-ranked terms in each column can be used as descriptors for a topic. To produce weights for the original documents in our corpus, we can "fold" the documents into the ensemble model by applying a projection to the document-term matrix A:</p><formula xml:id="formula_0">D = A • H</formula><p>T Each row of D now corresponds to a document, with columns corresponding to the k ensemble topics. An entry D ij indicates the strength of association of document i in ensemble topic j.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Evaluation</head><p>In this section we will give a brief summary of the dataset collected for this paper, the experimental setup, and finally an evaluation of our ensemble approach in comparison to randomly initialized NMF with respect to accuracy and stability of the topic models produced.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Data</head><p>One current area of interest for topic modeling is in the analysis of Twitter data <ref type="bibr" target="#b0">[1]</ref>. However, annotated ground truth text corpora are rarely available for this platform, due to the scale of data involved. To evaluate our proposed method in the context of social media data, we collected a new corpus, the 20-topics dataset, which consists of tweets from 1,200 user accounts corresponding to 20 different distinct ground-truth categories, as can be seen in Table <ref type="table" target="#tab_0">1</ref>. These different categories were manually identified by leveraging community-maintained lists of high-profile users who predominantly tweet about a single topic, such as fashion or music. Therefore, each user is assigned to a single category. Using the Twitter REST API we collected 4,170,382 tweets for these 1,200 "core" users over the period March 2015 to February 2016. In addition, to make the topic modeling task more challenging, we identified a second set of 4,000 users who were randomly selected from among the friends of the core users. These users are not annotated with a ground truth category label, and their content does not necessarily pertain to any of the categories. We collected 16,429,510 tweets for these "friend" users. We randomly divide this second set into blocks of 1,000 users, which allow us to vary the level of noise in our dataset when evaluating topic model solutions.</p><p>The full set of tweets was processed as follows. Firstly, all links and user mentions were stripped from the tweet text. Hashtags were kept, but the # prefix was removed. At this point, the tweets for each user for a given week were concatenated into a single "weekly user document". The justification for this is that individual tweets are short and often contain little textual content that is useful from the perspective of topic modeling. However, by combining multiple tweets from the same user into a single, longer document, we can perform topic modeling more effectively.</p><p>After creating these weekly user documents, we apply standard text preprocessing steps:</p><p>1. Find all individual tokens in each document, through conversion to lowercase and string tokenization. These tokens include both ordinary words and hashtags. 2. Remove single character tokens, emoticons, and tokens corresponding to generic stop words (e.g. "are", "the") and Twitter-specific stop words (e.g. "rt", "mt"). 3. Remove documents containing less than 3 tokens. 4. Construct a document-term matrix based on the remaining tokens and documents. Apply TF-IDF term weighting and document length normalization.</p><p>The resulting dataset consisted of a total of 40,722 weekly documents for core users and an additional 155,758 documents for friend users.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Experimental Setup</head><p>To evaluate the proposed method, we generated r = 100 base topic models using NMF with random initialization and combine them as described in Section 3.3. In each case we set the number of base topics (k = 20) and the number of ensemble topics (k = 20) to correspond to the number of ground truth categories. We ran this process on the initial set of 1,200 core users, and then repeated the process after including (1000, 2000, 3000, 4000) additional friend users, up to the case where all ≈ 195k weekly documents were included. These friend users were added to evaluate the accuracy and stability of randomly initialized NMF and our ensemble approach with respect to varying levels of noise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Evaluation of Stability</head><p>The goal of our first experiment was to quantify the extent to which instability is a problem with randomly-initialized NMF, and whether an ensemble approach can mitigate this instability. Firstly, we examined the stability between 100 base runs of randomly-initialized NMF to evaluate whether topics become less coherent with varying levels of noise. To do this, we assign each weekly user document to a single topic for which it has the highest weight according to the factor H, and then measure the agreement between the document assignments for different runs. As a measure of agreement, we use Normalized Mutual Information (NMI), which has previously been used in the evaluation of ensemble clusterings <ref type="bibr" target="#b15">[16]</ref>. A pair of topic models that are identical will achieve a NMI score of 1.0 (i.e. high stability), while a pair with little agreement will achieve a lower score (i.e. low stability). We compute an overall stability score by calculating the NMI between all pairs of models for a given number of friend users and calculating the mean of these values.</p><p>We calculated the NMI score for each unique pair of topic model outputs. To evaluate the stability of randomly-initialized NMF with respect to varying levels of noise, this was repeated while adding weekly summary documents from the friend user set. To vary the level of noise added these were added in blocks of 1,000 at a time, up to 4,000 friend users. Fig. <ref type="figure" target="#fig_1">2</ref> shows the stability scores for randomly-initialized NMF for each case. It is clear that as the level of background noise increases, we see a greater variation in the outputs produced by NMF, as it becomes more challenging to identify a definitive solution.</p><p>To provide some context as to what this instability means in practice, Table <ref type="table" target="#tab_1">2</ref> shows an example of descriptors for a topic relating to UK politics, as they appear in five different runs of NMF. While each case does appear to be related to politics, we see variation in the composition and ordering of the top-ranked terms, with terms such as "Cameron" and "tax" appearing intermittently.</p><p>To determine whether our proposed approach can address this problem, we generated 10 ensemble topic models, each comprised of 100 different base topic models initialized with different random seeds. Again we compute the mean  pairwise agreement between the document assignments for all runs. We see from Fig. <ref type="figure" target="#fig_1">2</ref> that outputs from the ensemble method produces a much more stable solution, even when increasing the level of noise in the data. The stability scores for the ensemble approach have quite a small variation, ranging from 0.9929 to 0.9353 while the scores for randomly initialized NMF vary much more, ranging from 0.8394 to 0.6368. Our ensemble approach manages to produce a definitive topic modeling solution which crucially can be replicated across different runs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Evaluation of Accuracy</head><p>While stability is an important requirement, we also need to ensure that we can produce a topic model which accurately summarizes the contents of the corpus. Specifically, we now focus on whether combining a base set of unstable topic models using our ensemble method produces an accurate result relative to the ground truth annotations in the 20-topics corpus. Firstly, we can manually inspect the topic descriptors generated by applying ensemble topic modeling. Table <ref type="table" target="#tab_2">3</ref> shows the descriptors for the case where ensemble topic modeling is applied to the set of 1,200 users, along with a manually selected label corresponding to the most similar ground truth category. We see that 18 out of 20 ground truth categories are clearly identified, with two categories ('Irish politics' and 'football') replaced by two extra topics relating to 'energy' and 'technology'. In general we observed that, across all experiments on this corpus, the 'Irish politics' topic consistently overlapped with the 'UK politics' topic, while the 'football' topic frequently overlapped with the 'NFL' topic. This is perhaps unsurprising given the partially shared vocabulary in both cases. To quantitatively evaluate accuracy, we can use NMI to measure the degree to which document assignments from a topic model agree with the ground truth categories listed in Table <ref type="table" target="#tab_0">1</ref>. Again we consider the case where increasing numbers of noisy documents from friend users are added to the data. Note that, while we add friend users we only consider the document-topic assignments for our set of 'core' users when calculating the NMI score.</p><p>Based on 100 runs of randomly-initialized NMF, Fig. <ref type="figure" target="#fig_2">3</ref> shows the mean, minimum, and maximum NMI scores. We can make two observations based on these results. Firstly, the mean accuracy of the topic models decreases considerably as more friend users are added. Secondly, there is considerable variation in accuracy across the 100 runs, due to random initialization. In contrast, Fig. <ref type="figure" target="#fig_2">3</ref> shows that ensemble topic modeling achieves a level of accuracy above the accuracy maximum for the ensemble members from which it was compromised -in this case the ensemble topic model is "greater than the sum of its parts". Taking the result in conjunction with the results from Section 4.3, this suggests that the combination of many unstable and diverse base topic models can produce a more accurate topic model. From Fig. <ref type="figure" target="#fig_2">3</ref>, we also observe that the decline in NMI as more friend users are added is less pronounced, suggesting that the ensemble method is more robust to noise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions</head><p>In this paper we have proposed a new ensemble topic modeling method, based on the combination of multiple matrix factorizations to produce a single ensemble model. We compared its performance to standard NMF on a tweet corpus, in terms of both stability and accuracy. We have observed that the proposed method not only yields a more accurate topic model with respect to documenttopic assignments, it also produces a far more stable output, with little variation across multiple runs.</p><p>There are a number of future avenues of research which we would like to explore. Firstly, we intend to evaluate the proposed method on a range of other datasets, which consist of not only tweets but other sources of text such as news articles. We would also like to investigate alternative ensemble generation strategies, such as random subsampling of documents and terms, to evaluate if promoting further diversity improves the quality of the ensemble results. We would also like to investigate the number of base topic models required in the ensemble generation phase to generate an accurate and stable solution. Finally, we would be interested in generalizing our ensemble approach to work with other topic modeling algorithms, such as LDA, where instability is also an issue.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Illustration of the two steps involved in the ensemble topic modeling algrotihm: generation and integration.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Comparison of stability for randomly-initialized NMF and ensemble topic modeling, based on mean pairwise NMI agreement, for increasing numbers of friends users.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Comparison of NMI accuracy for randomly-initialized NMF and ensemble topic modeling, for increasing numbers of friends users.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Number of tweets, unique user accounts, and user documents for each topic in the 20-topics dataset.</figDesc><table><row><cell>Category</cell><cell>Tweets</cell><cell>Users</cell><cell>User Documents</cell></row><row><cell>Aviation</cell><cell>186,641</cell><cell>57</cell><cell>2,440</cell></row><row><cell>Basketball</cell><cell>245,359</cell><cell>61</cell><cell>1,467</cell></row><row><cell>Business</cell><cell>223,148</cell><cell>70</cell><cell>1,876</cell></row><row><cell>Energy</cell><cell>125,130</cell><cell>40</cell><cell>1,621</cell></row><row><cell>Fashion</cell><cell>159,819</cell><cell>40</cell><cell>1,227</cell></row><row><cell>Food</cell><cell>159,615</cell><cell>45</cell><cell>1,775</cell></row><row><cell>Football</cell><cell>359,393</cell><cell>89</cell><cell>1,524</cell></row><row><cell>Formula One</cell><cell>143,197</cell><cell>42</cell><cell>1,757</cell></row><row><cell>Health</cell><cell>209,941</cell><cell>60</cell><cell>2,542</cell></row><row><cell>Irish Politics</cell><cell>170,000</cell><cell>50</cell><cell>2,318</cell></row><row><cell>Movies</cell><cell>139,337</cell><cell>38</cell><cell>1,395</cell></row><row><cell>Music</cell><cell>208,838</cell><cell>56</cell><cell>1,539</cell></row><row><cell>NFL</cell><cell>255,554</cell><cell>80</cell><cell>1,388</cell></row><row><cell>Rugby</cell><cell>265,123</cell><cell>76</cell><cell>2,264</cell></row><row><cell>Space</cell><cell>127,280</cell><cell>51</cell><cell>2,157</cell></row><row><cell>Tech</cell><cell>250,486</cell><cell>66</cell><cell>1,947</cell></row><row><cell>Tennis</cell><cell>139,067</cell><cell>41</cell><cell>1,427</cell></row><row><cell>UK Politics</cell><cell>245,651</cell><cell>77</cell><cell>3,182</cell></row><row><cell>US Politics</cell><cell>332,766</cell><cell>103</cell><cell>4,503</cell></row><row><cell>Weather</cell><cell>224,037</cell><cell>65</cell><cell>2,373</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Example of instability between 5 different runs of randomly-initialised NMF, for topics relating to UK politics.</figDesc><table><row><cell cols="2">Run Top 10 Terms</cell></row><row><cell>1</cell><cell>labour, tories, tory, nhs, people, cameron, uk, party, mp, support</cell></row><row><cell>2</cell><cell>labour, people, ge16, tories, vote, support, tory, party, government, nhs</cell></row><row><cell>3</cell><cell>labour, tories, uk, tory, nhs, people, cameron, tax, mp, party</cell></row><row><cell>4</cell><cell>labour, people, ge16, tories, vote, tory, support, party, government, nhs</cell></row><row><cell>5</cell><cell>labour, people, ge16, tories, uk, government, vote, support, govt, tory</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Topic descriptors for 20 topics generated by applying ensemble topic modeling to the 20-topics corpus, using tweets from 1,200 core users. The most similar ground truth category for each topic is also listed.</figDesc><table><row><cell>Category</cell><cell>Top 10 Terms</cell></row><row><cell>Energy 1</cell><cell>fracking, shale, gas, energy, natgas, natural, naturalgas, pa, epa,</cell></row><row><cell></cell><cell>emissions</cell></row><row><cell>US Politics</cell><cell>gopdebate, president, gop, obama, senate, clinton, bill, hillary,</cell></row><row><cell></cell><cell>trump, congress</cell></row><row><cell>Rugby</cell><cell>rugby, rwc2015, england, cup, wales, ireland, world, try, match,</cell></row><row><cell></cell><cell>rbs6nations</cell></row><row><cell>NFL</cell><cell>game, nfl, season, win, patriots, team, league, football, tonight, goal</cell></row><row><cell>Tech 1</cell><cell>apple, watch, applewatch, google, app, music, tv, ios, facebook, mac-</cell></row><row><cell></cell><cell>book</cell></row><row><cell>UK Politics</cell><cell>labour, ge16, people, tories, vote, tory, party, government, nhs, sup-</cell></row><row><cell></cell><cell>port</cell></row><row><cell>Basketball</cell><cell>bulls, rose, butler, hoiberg, gasol, nba, game, noah, jimmy, pau</cell></row><row><cell>Weather</cell><cell>rain, snow, weather, forecast, showers, storm, tornado, severe, dry,</cell></row><row><cell></cell><cell>winds</cell></row><row><cell>Business</cell><cell>china, stocks, market, fed, markets, stock, growth, tech, uk, ftse</cell></row><row><cell>Health</cell><cell>health, cancer, study, risk, patients, care, diabetes, zika, drug, dis-</cell></row><row><cell></cell><cell>ease</cell></row><row><cell>Music</cell><cell>album, music, video, listen, song, track, remix, tour, premiere, check</cell></row><row><cell>Aviation</cell><cell>avgeek, aviation, boeing, flight, airlines, air, aircraft, airbus, airport,</cell></row><row><cell></cell><cell>paxex</cell></row><row><cell>Tech 2</cell><cell>iphone, ios, ipad, mac, apple, app, os, apps, beta, plus</cell></row><row><cell>Fashion</cell><cell>fashion, daily, nyfw, stories, style, collection, dress, wear, beauty,</cell></row><row><cell></cell><cell>show</cell></row><row><cell>Food</cell><cell>recipes, recipe, food, chicken, best, dinner, delicious, chocolate, chef,</cell></row><row><cell></cell><cell>restaurant</cell></row><row><cell cols="2">Formula One f1, race, ferrari, hamilton, mclaren, mercedes, renault, rosberg, gp,</cell></row><row><cell></cell><cell>bull</cell></row><row><cell>Movies</cell><cell>film, review, movie, trailer, star, wars, movies, films, awakens, oscars</cell></row><row><cell>Tennis</cell><cell>tennis, atp, murray, djokovic, federer, serena, nadal, wimbledon,</cell></row><row><cell></cell><cell>ausopen, wta</cell></row><row><cell>Space</cell><cell>space, yearinspace, pluto, earth, nasa, mars, mission, launch, jour-</cell></row><row><cell></cell><cell>neytomars, science</cell></row><row><cell>Energy 2</cell><cell>oil, energy, gas, crude, prices, opec, offshore, production, exports,</cell></row><row><cell></cell><cell>oilandgas</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgement. This research was supported by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Sensing trending topics in Twitter</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Aiello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Petkos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Corney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Papadopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Skraba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Göker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kompatsiaris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jaimes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Multimedia</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="1268" to="1282" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Jordan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="993" to="1022" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">SVD based initialization: A head start for nonnegative matrix factorization</title>
		<author>
			<persName><forename type="first">C</forename><surname>Boutsidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gallopoulos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Bagging predictors</title>
		<author>
			<persName><forename type="first">L</forename><surname>Breiman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="123" to="140" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Indexing by latent semantic analysis</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C</forename><surname>Deerwester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">K</forename><surname>Landauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">W</forename><surname>Furnas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Harshman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society of Information Science</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="391" to="407" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Finding consistent clusters in data partitions</title>
		<author>
			<persName><forename type="first">A</forename><surname>Fred</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 2nd International Workshop on Multiple Classifier Systems (MCS&apos;01)</title>
				<meeting>2nd International Workshop on Multiple Classifier Systems (MCS&apos;01)</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2001-01">January 2001</date>
			<biblScope unit="volume">2096</biblScope>
			<biblScope unit="page" from="309" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A Survey: Clustering Ensembles Techniques</title>
		<author>
			<persName><forename type="first">R</forename><surname>Ghaemi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sulaiman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ibrahim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Mustapha</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of World Academy of Science, Engineering AND Technology</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="2070" to="3740" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">How Many Topics? Stability Analysis for Topic Models</title>
		<author>
			<persName><forename type="first">D</forename><surname>Greene</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>O'callaghan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cunningham</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. European Conference on Machine Learning (ECML&apos;14)</title>
				<meeting>European Conference on Machine Learning (ECML&apos;14)</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="498" to="513" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions</title>
		<author>
			<persName><forename type="first">D</forename><surname>Greene</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cagney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Krogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cunningham</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">15</biblScope>
			<biblScope unit="page" from="1722" to="1728" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Exploring the political agenda of the european parliament using a dynamic topic modelling approach</title>
		<author>
			<persName><forename type="first">D</forename><surname>Greene</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Cross</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">5th Annual General Conference of the European Political Science Association (EPSA&apos;</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">15</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Learning the parts of objects by non-negative matrix factorization</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">S</forename><surname>Seung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature</title>
		<imprint>
			<biblScope unit="volume">401</biblScope>
			<biblScope unit="page" from="788" to="791" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Projected gradient methods for non-negative matrix factorization</title>
		<author>
			<persName><forename type="first">C</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Computation</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="2756" to="2779" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Generating accurate and diverse members of a neuralnetwork ensemble</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Opitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Shavlik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="535" to="541" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Soft Cluster Ensembles</title>
		<author>
			<persName><forename type="first">K</forename><surname>Punera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ghosh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Fuzzy Clustering and Its Applications</title>
				<imprint>
			<publisher>Wiley</publisher>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Latent Semantic Analysis: A Road to Meaning, chap. Probabilistic topic models</title>
		<author>
			<persName><forename type="first">M</forename><surname>Steyvers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Griffiths</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
			<publisher>Laurence Erlbaum</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Cluster ensembles -a knowledge reuse framework for combining multiple partitions</title>
		<author>
			<persName><forename type="first">A</forename><surname>Strehl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ghosh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="583" to="617" />
			<date type="published" when="2002-12">December 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Cluster ensembles -a knowledge reuse framework for combining partitionings</title>
		<author>
			<persName><forename type="first">A</forename><surname>Strehl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ghosh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Conference on Artificial Intelligence (AAAI&apos;02)</title>
				<meeting>Conference on Artificial Intelligence (AAAI&apos;02)</meeting>
		<imprint>
			<publisher>AAAI/MIT Press</publisher>
			<date type="published" when="2002-07">July 2002</date>
			<biblScope unit="page" from="93" to="98" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Clustering ensembles: Models of consensus and weak partitions</title>
		<author>
			<persName><forename type="first">A</forename><surname>Topchy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Punch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="1866" to="1881" />
			<date type="published" when="2005-12">December 2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Group matrix factorization for scalable topic modeling</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 35th SIGIR Conf. on Research and Development in Information Retrieval</title>
				<meeting>35th SIGIR Conf. on Research and Development in Information Retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="375" to="384" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
