<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Mining Newsworthy Topics from Social Media</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Carlos</forename><surname>Martin</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">IDEAS Research Institute</orgName>
								<orgName type="department" key="dep2">School of Computing &amp; Digital Media</orgName>
								<orgName type="institution">Robert Gordon University</orgName>
								<address>
									<postCode>AB10 7QB</postCode>
									<settlement>Aberdeen</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">David</forename><surname>Corney</surname></persName>
							<email>d.p.a.corney@rgu.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">IDEAS Research Institute</orgName>
								<orgName type="department" key="dep2">School of Computing &amp; Digital Media</orgName>
								<orgName type="institution">Robert Gordon University</orgName>
								<address>
									<postCode>AB10 7QB</postCode>
									<settlement>Aberdeen</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ayse</forename><surname>Göker</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">IDEAS Research Institute</orgName>
								<orgName type="department" key="dep2">School of Computing &amp; Digital Media</orgName>
								<orgName type="institution">Robert Gordon University</orgName>
								<address>
									<postCode>AB10 7QB</postCode>
									<settlement>Aberdeen</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andrew</forename><surname>Macfarlane</surname></persName>
							<email>a.macfarlane-1@city.ac.uk</email>
							<affiliation key="aff1">
								<orgName type="department">School of Informatics</orgName>
								<orgName type="institution">City University London</orgName>
								<address>
									<postCode>EC1V 0HB</postCode>
									<settlement>London</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Mining Newsworthy Topics from Social Media</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">D1C9A22EEE6330FE1245034406358D40</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T03:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>topic detection</term>
					<term>Twitter</term>
					<term>temporal analysis</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Newsworthy stories are increasingly being shared through social networking platforms such as Twitter and Reddit, and journalists now use them to rapidly discover stories and eye-witness accounts. We present a technique that detects "bursts" of phrases on Twitter that is designed for a real-time topic-detection system. We describe a time-dependent variant of the classic tf-idf approach and group together bursty phrases that often appear in the same messages in order to identify emerging topics. We demonstrate our methods by analysing tweets corresponding to events drawn from the worlds of politics and sport. We created a user-centred "ground truth" to evaluate our methods, based on mainstream media accounts of the events. This helps ensure our methods remain practical. We compare several clustering and topic ranking methods to discover the characteristics of news-related collections, and show that different strategies are needed to detect emerging topics within them. We show that our methods successfully detect a range of different topics for each event and can retrieve messages (for example, tweets) that represent each topic for the user.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The growth of social networking sites, such as Twitter, Facebook and Reddit, is well documented. Every day, a huge variety of information on different topics is shared by many people. Given the real-time, global nature of these sites, they are used by many people as a primary source of news content <ref type="bibr" target="#b0">[1]</ref>. Increasingly, such sites are also used by journalists, partly to find and track breaking news but also to find user-generated content such as photos and videos, to enhance their stories. These often come from eye-witnesses who would be otherwise difficult to find, especially given the volume of content being shared.</p><p>Our overall goal is to produce a practical tool to help journalists and news readers to find newsworthy topics from message streams without being overwhelmed. Note that it is not our intention to re-create Twitter's own "trending topics" functionality. That is usually dominated by very high-level topics and memes, defined by just one or two words or a name and with no emphasis on 'news'.</p><p>Our system works by identifying phrases that show a sudden increase in frequency (a "burst") and then finding co-occurring groups to identify topics. Such bursts are typically responses to real-world events. In this way, the news consumer can avoid being overwhelmed by redundant messages, even if the initial stream is formed of diverse messages. The emphasis is on the temporal nature of message streams as we bring to the surface groups of messages that contain suddenly-popular phrases. An early version of this approach was recently described <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref>, where it compared favourably to several alternatives and benchmarks. Here we expand and update that work, examining the effect of different clustering and topic ranking approaches used to form coherent topics from bursty phrases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Newman <ref type="bibr" target="#b3">[4]</ref> discusses the central use of social media by news professionals, such as hosting live blogs of ongoing events. He also describes the growth of collaborative, networked journalism, where news professionals draw together a wide range of images, videos and text from social networks and provide a curation service. Broadcasters and newspapers can also use social media to increase brand loyalty across a fragmented media marketplace.</p><p>Petrovic et al. <ref type="bibr" target="#b4">[5]</ref> focus on the task of first-story detection (FSD), which they also call "new event detection". They use a locality sensitive hashing technique on 160 million Twitter posts, hashing incoming tweet vectors into buckets in order to find the nearest neighbour and hence detect new events and track them. This work is extended in Petrovic et al. <ref type="bibr" target="#b5">[6]</ref> using paraphrases for first story detection on 50 million tweets. Their FSD evaluation used newswire sources rather than Tweets, based on the existing TDT5 datasets. The Twitter-based evaluation was limited to calculating the average precision of their system, by getting two human annotators to label the output as being about an event or not. This contrasts with our goal here, which is to measure the topic-level recall, i.e. to count how many newsworthy stories the system retrieved.</p><p>Benhardus <ref type="bibr" target="#b6">[7]</ref> uses standard collection statistics such as tf-idf, unigrams and bigrams to detect trending topics. Two data collections are used, one from the Twitter API and the second being the Edinburgh Twitter corpus containing 97 million tweets, which was used as a baseline with some natural language processing used (e.g. detecting prepositions or conjunctions). The research focused on general trending topics (typically finding personalities and for new hashtags) rather than focusing the needs of journalistic users and news readers.</p><p>Shamma et al. <ref type="bibr" target="#b7">[8]</ref> focus on "peaky topics" (topics that show highly localized, momentary interest) by using unigrams only. The focus of the method is to obtain peak terms for a given time slot when compared to the whole corpus rather than over a given time-frame. The use of the whole corpus favours batchmode processing and is less suitable for real-time and user-centred analysis. <ref type="bibr" target="#b8">[9]</ref> analysed 154,000 tweets that contained the hashtag '#breakingnews". They determine popularity of messages by counting retweets and detecting popular terms such as nouns and verbs. This work is taken further with a simple tf-idf scheme that is used to identify similarity <ref type="bibr" target="#b9">[10]</ref>; named entities are then identified using the Stanford Named Entity Recogniser in order to identify communities and similar message groups. Sayyadi et al. <ref type="bibr" target="#b10">[11]</ref> also model the community to discover and detect events on the live Labs SocialStream platform, extracting keywords, noun phrases and named entities. Ozdikis et al. <ref type="bibr" target="#b11">[12]</ref> also detect events using hashtags by clustering them and finding semantic similarities between hashtags, the latter being more of a lexicographic method. Ratkiewitcz et al. <ref type="bibr" target="#b12">[13]</ref> focus specifically on the detection of a single type of topic, namely political abuse. Evidence used include the use of hashtags and mentions. Alvanaki <ref type="bibr" target="#b13">[14]</ref> propose a system based on popular seed tags (tag pairs) which are then tracked, with any shifts detected and monitored. These articles do use natural language processing methods and most consider temporal factors, but do not use n-grams.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Phuvipadawat and Murata</head><p>Becker et al. <ref type="bibr" target="#b14">[15]</ref> also consider temporal issues by focusing on the online detection of real world events, distinguishing them from non-events (e.g. conversations between posters). Clustering and classification algorithms are used to achieve this. Methods such as n-grams and NLP are not considered.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">BNgrams</head><p>Term frequency-inverse document frequency, or tf-idf, has been used for indexing documents since it was first introduced <ref type="bibr" target="#b15">[16]</ref>. We are not interested in indexing documents however, but in finding novel trends, so we want to find terms that appear in one time period more than others. We treat temporal windows as documents and use them to detect words and phrases that are both new and significant. We therefore define newsworthiness as the combination of novelty and significance. We can maximise significance by filtering tweets either by keywords (as in this work) or by following a carefully chosen list of users, and maximise novelty by finding bursts of suddenly high-frequency words and phrases.</p><p>We select terms with a high "temporal document frequency-inverse document frequency", or df-idf t , by comparing the most recent x messages with the previous x messages and count how many contain the term. We regard the most recent x messages as one "slot". After standard tokenization and stop-word removal, we index all the terms from these messages. For each term, we calculate the document frequency for a set of messages using df ti , defined as the number of messages in a set i that contain the term t.</p><formula xml:id="formula_0">df −idf ti = (df ti + 1) • 1 log df t(i−1) + 1 + 1 .<label>(1)</label></formula><p>This produces a list of terms which can be ranked by their df-idf t scores. Note that we add one to term counts to avoid problems with dividing by zero or taking the log of zero. To maintain some word order information, we define terms as n-grams, i.e. sequences of n words. Based on experiments reported elsewhere <ref type="bibr" target="#b2">[3]</ref>, we use 1-, 2-and 3-grams in this work. High frequency n-grams are likely to represent semantically coherent phrases. Having found bursts of potentially newsworthy n-grams, we then group together n-grams that tend to appear in the same tweets. Each of these clusters defines a topic as a list of n-grams, which we also illustrate with a representative tweet. We call this process of finding bursty n-grams"BNgrams."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Topic Clustering</head><p>An isolated word or phrase is often not very informative, but a group of them can define the essence of a story. Therefore, we group the most representative phrases into clusters, each representing a single topic. A group of messages that discuss the same topic will tend to contain at least some of the same phrases. We can then find the message that contains the most phrases that define a topic, and use that message as a human-readable label for the topic. We now discuss three clustering algorithms that we compare here.</p><p>Hierarchical clustering. Here, we initially assign every n-gram to its own singleton cluster, then follow a standard "group average" hierarchical clustering algorithm <ref type="bibr" target="#b16">[17]</ref> to iteratively find and merge the closest pair of clusters. We repeat this until no two clusters share more than half their terms, at which point we assume that each cluster represents a distinct topic. We define the similarity between two terms as the fraction of messages in the same time slot that contain both of them, so it is highly likely that the term clusters whose similarities are high represent the same topic. Further details about this algorithm and its parameters can be found in our previous published work <ref type="bibr" target="#b1">[2]</ref>.</p><p>Apriori algorithm. The Apriori algorithm <ref type="bibr" target="#b17">[18]</ref> finds all the associations between the most representative n-grams based on the number of tweets in which they co-occur. Each association is a candidate topic at the end of the process. One of the advantages of this approach is that one n-gram can belong to different associations (partial membership), avoiding one problem with hierarchical clustering. No number of associations has to be specified in advance. We also obtain maximal associations after clustering to avoid large overlaps in the final set of topic clusters.</p><p>Gaussian mixture models. GMMs assign probabilities (or strengths) of membership of each n-gram to each cluster, allowing partial membership of multiple clusters. This approach does require the number of clusters to be specified in advance, although this can be automated (e.g. by using Bayesian information criteria <ref type="bibr" target="#b18">[19]</ref>). Here, we use the Expectation -Maximisation algorithm to optimise a Gaussian mixture model <ref type="bibr" target="#b19">[20]</ref>. We fix the number of clusters at 20, although initial experiments showed that using more or fewer produced very similar results. Seeking more clusters in the data than there are newsworthy topics means that some clusters will contain irrelevant tweets and outliers, which can later be assigned a low rank and effectively ignored, leaving us with a few highly-ranked clusters that are typically newsworthy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Topic Ranking</head><p>To maximise usability we need to avoid overwhelming the user with a very large number of topics. We therefore want to rank the results by relevance. Here, we compare two topic ranking techniques.</p><p>Maximum n-gram df −idf t . One method is to rank topics according to the maximum df − idf t value of their constituent n-grams. The motivation of this approach is assume that the most popular n-gram from each topic represents the core of the topic.</p><p>Weighted topic-length. As an alternative we propose weighting the topiclength (i.e. the number of terms found in the topic) by the number of tweets in the topic to produce a score for each topic. Thus the most detailed and popular topics are assigned higher rankings. We define this score thus:</p><formula xml:id="formula_1">s t = α • L t L max + (1 − α) • N t N s (2)</formula><p>where s t is the score of topic t, L t is the length of the topic, L max is the maximum number of terms in any current topic, N t is the number of tweets in topic t and N s is the number of tweets in the slot. Finally, α is a weighting term. Setting α to 1 rewards topics with more terms; setting α to 0 rewards topics with more tweets. We used α = 0.7 in our experiments, giving slightly more weight to those stories containing more details, although the exact value is not critical.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>Here, we show the results of our experiments with several variations of the BNgram approach. We focus on two questions. First, what is best slot size to balance topic recall and refresh rate? A very small slot size might lead to missed stories as too few tweets would be analysed; conversely, a very large slot size means that topics would only be discovered some time after they have happened. This low 'refresh rate' would reduce the timeliness of the results. Second, what the best combination of clustering and topic ranking techniques? Earlier, we introduced three clustering methods and two topic ranking methods; we need to determine which methods are most useful.</p><p>We have previously shown that our methods perform well <ref type="bibr" target="#b1">[2]</ref>. The BNgram approach was compared to a popular baseline system in topic detection and tracking -Latent Dirichlet Allocation (LDA) <ref type="bibr" target="#b20">[21]</ref> -and to several other competitive topic detection techniques, getting the best overall topic recall. In addition, we have shown the benefits of using n-grams compared with single words for this sort of analysis <ref type="bibr" target="#b2">[3]</ref>. Below, we present and discuss the results from our current experiments, starting with our approach to evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Evaluation Methods</head><p>When evaluating any IR system, it is crucial to define a realistic test problem. We used three Twitter data sets focused on popular real-world events and compare the topics that our algorithm finds with an externally-defined ground truth.</p><p>Fig. <ref type="figure">1</ref>: Twitter activity during events (tweets per minute). For the FA Cup, the peaks correspond to start and end of the match and the goals. For the two political collections, the peaks correspond to the main result announcements.</p><p>To establish this ground truth, we relied on mainstream media (MSM) reports of the three events. This use of MSM sources helps to ensure that our ground truth topics are newsworthy (by definition) and that the evaluation is goal-focussed (i.e. will help journalists write such stories). We filtered Twitter using relevant keywords and hashtags to collect tweets around three events : the "Super Tuesday" primaries, part of the presidential nomination race of the US Republican Party; the 2012 FA Cup final, the climax to the English football season; and the 2012 US presidential election, an event of global significance. In each case, we reviewed the published MSM accounts of the events and chose a set of stories that were significant, time-specific, and represented on Twitter. For example, we ignored general reviews of the state of US politics (not time-specific), and quotes from members of the public (not significant events).</p><p>For each target topic, we identified around 5-7 keywords that defined the story to measure recall and precision, as discussed below. Some examples are shown in the first two columns of Table <ref type="table">4</ref>. We also defined several "forbidden" keywords. A topic was only considered as successfully recalled if all of the "mandatory" terms were retrieved and none of the "forbidden" terms. The aim was to avoid producing topics such as "victory Romney Paul Santorum Gingrich Alaska Georgia" that convey no information about who won or where; or "Gingrich wins", which is too limited to define the story because it doesn't name the state where the victory occurred.</p><p>Figure <ref type="figure">1</ref> shows the frequency of tweets collected over time, with further details in ref. <ref type="bibr" target="#b1">[2]</ref>. We have made all the data freely available 3 . The three data sets differ in the rates of tweets, determined by the popularity of the topic and the choice of filter keywords. The mean tweets per minute (tpm) were: Super Tuesday, 832 tpm; FA Cup, 1293 tpm; and US elections, 2209 tpm. For a slot size of 1500 tweets these correspond to a "topic refresh rate" of 108s, 70s and 41s respectively. This means that a user interface displaying these topics could be updated every 1-2 minutes to show the current top-10 (or top-m) stories.</p><p>We ran the topic detection algorithm on each data set. This produced a ranked list of topics, each defined by a set of terms (i.e. n-grams). For our evaluation, we focus on the recall of the top m topics (1 ≤ m ≤ 10) at the time each ground-truth story emerges. For example, if a particular story was being discussed in the mainstream media from 10:00-10:15, then we consider the topic to be recalled if the system ranked it in the top m at any time during that period.</p><p>The automatically detected topics were compared to the ground truth (comprising 22 topics for Super Tuesday; 13 topics for FA Cup final; and 64 topics for US elections) using three metrics: Topic recall: Percentage of ground truth topics that were successfully detected by a method. A topic was considered successfully detected if the automatically produced set of words contained all mandatory keywords for it (and none of the forbidden terms, if defined). Keyword precision: Percentage of correctly detected keywords out of the total number of keywords for all topics detected by the algorithm in the slot. Keyword recall: Percentage of correctly detected keywords over the total number of ground truth keywords (excluding forbidden keywords) in the slot. One key difference between "topic recall" and "keyword recall" is that the former is a user-centred evaluation metric, as it considers the power of the system at retrieving and displaying to the user stories that are meaningful and coherent, as opposed to retrieving only some keywords that are potentially meaningless in isolation.</p><p>Note that we do not attempt to measure topic precision as this would need an estimate of the total number of newsworthy topics at any given time, in order to verify which (and how many) of the topics returned by our system were in fact newsworthy. This would require an exhaustive manual analysis of MSM sources to identify every possible topic (or some arbitrary subset), which is infeasible. One option is to compare detected events to some other source, such as Wikipedia, to verify the significance of the event <ref type="bibr" target="#b21">[22]</ref>, but Wikipedia does not necessarily correspond to particular journalists' requirements regarding newsworthiness and does not claim to be complete.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Results</head><p>Table <ref type="table" target="#tab_0">1</ref> shows the effect on topic recall of varying the slot size, with the same total number of topics in the evaluation for each slot size. The mean is weighted by the number of topics in the ground truth for each set, giving greater importance to larger test sets. Overall, using very few tweets produces slightly worse results than with larger slot sizes (e.g. 1500 tweets), presumably as there is too little information in such a small collection. Slightly better results for the Super Tuesday set occur with fewer tweets; this could be due to the slower tweet rate in this set. Note that previous experiments <ref type="bibr" target="#b2">[3]</ref> showed that including 3-grams improves recall compared to just using 1-and 2-grams, but adding 4-grams provides no extra benefit, so here we use 1-, 2-and 3-gram phrases throughout. Lastly, we compared the results of combining different clustering techniques with different topic ranking techniques (see Fig. <ref type="figure" target="#fig_0">2</ref>). We conclude that the hierarchical clustering performs well despite the weakness discussed above (i.e. each n-gram is assigned to only one cluster), especially in FA Cup dataset. Also, the use of weighted topic-length ranking technique improves topic recall with hierarchical clustering in the political data sets.</p><p>The Apriori algorithm performs quite well in combination with the weighted topic length ranking technique (note that this ranking technique was specially created for the "partial" membership clustering techniques). We see that the Apriori algorithm in combination with the maximum n-gram df − idf t ranking technique produces slightly worse results, as this ranking technique does not produce diverse topics for the first results (from top 1 to top 10, in our case) as we mentioned earlier.</p><p>Turning to the EM Gaussian mixture model results, we see that this method works very well on the FA Cup final and US elections data sets. Despite being a "partial" membership clustering technique, the use of weighted topic length ranking technique does not make any representative difference, even its performance is worse in Super Tuesday dataset. Further work is needed to test this. Table <ref type="table" target="#tab_1">2</ref> summarises the results of the three clustering methods and the two ranking methods across all three data sets. The weighted-mean scores show that for the three clustering methods, ranking by the length of the topic is more effective than ranking by each topic's highest df − idf t score. We can see that for the FA Cup set, the Hierarchical and GMM clustering methods are the best ones in combination with the maximum n-gram df − idf t ranking technique. For Super Tuesday and US Elections data sets, "partial" membership clustering techniques (Apriori and GMM, respectively) perform the best in combination with weighted topic length ranking technique, as expected.</p><p>Finally, Table <ref type="table" target="#tab_2">3</ref> shows more detailed results, including keyword precision and recall, for the best combinations of clustering and topic ranking methods of the three datasets when the top five results are considered per slot. In addition, Table <ref type="table">4</ref> shows some examples of ground truth and BNgram detected topics and tweets within the corresponding detected topics for all datasets. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions</head><p>If we compare the results between the three collections, one difference is particularly striking: the topic recall is far higher for football (over 90%) than for politics (around 60-80%; Table <ref type="table" target="#tab_1">2</ref>). This is likely to reflect the different nature of conversations about the events.  making this type of trend detection much harder. Looking back at the distribution of the tweets over time (Figure <ref type="figure">1</ref>), we can see clear spikes in the FA Cup graph, each corresponding to a major event (kick-off, goals, half-time, full-time etc.). No such clarity is in the politics graphs, which instead is best viewed as many overlapping trends. This difference is reflected in the way that major news stories often emerge: an initial single, focussed story emerges but is later replaced with several potentially overlapping sub-stories covering different aspects of the story. Our results suggest that a dynamic approach may be required for newsworthy topic detection, finding an initial clear burst and subsequently seeking more subtle and overlapping topics.</p><p>Recently, Twitter has been actively increasing its ties to television<ref type="foot" target="#foot_1">4</ref> . Broadcast television and sporting events share several common features: they occur a pre-specified times; they attract large audiences; and they are fast-paced. These features all allow and encourage audience participation in the form of sharing comments and holding discussions during the events themselves, such that the Table <ref type="table">4</ref>: Examples of the mainstream media topics, the target keywords, the topics extracted by the df-idf t algorithm, and example tweets selected by our system from the collections. focus of the discussion is constantly moving with the event itself. Potentially, this can allow targeted time-sensitive promotions and advertising based on topics currently receiving the most attention. Facebook and other social media are also competing for access to this potentially valuable "second screen" <ref type="bibr" target="#b22">[23]</ref>. Television shows are increasingly promoting hashtags in advance, which may make collecting relevant tweets more straightforward. Even if topic detection with news requires slightly different methods compared to sport and television, both have substantial and growing demand.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: Topic recall for different clustering techniques in the Super Tuesday, FA Cup and US elections sets (slot size = 1500 tweets).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Topic recall for different slot sizes (with hierarchical clustering).</figDesc><table><row><cell cols="2">Slot size (tweets) 500 1000 1500 2000 2500</cell></row><row><cell>Super Tuesday</cell><cell>0.773 0.727 0.682 0.545 0.682</cell></row><row><cell>FA Cup</cell><cell>0.846 0.846 0.923 0.923 0.923</cell></row><row><cell>US Elections</cell><cell>0.750 0.781 0.844 0.734 0.766</cell></row><row><cell cols="2">Weighted mean 0.77 0.78 0.82 0.72 0.77</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Topics within a live sports event tend to be transient: fans care (or at least tweet) little about what happened five minutes ago; what matters is what is happening "now". This is especially true during key events, such as goals. In politics, conversations and comments tend to spread over hours (or even days) rather than minutes. This means that sports-related topics tend to occur over a much narrower window, with less overlapping chatter. In politics, several different topics are likely to be discussed at the same time, Normalised area under the curve for the three datasets combining the different clustering and topic ranking techniques (1500 tweets per slot).</figDesc><table><row><cell>Ranking</cell><cell cols="4">Max. n-gram df − idft Weighted topic-length</cell></row><row><cell>Clustering</cell><cell cols="4">Hierar. Apriori GMM Hierar. Apriori GMM</cell></row><row><cell>FA Cup</cell><cell cols="4">0.923 0.677 0.923 0.861 0.754 0.892</cell></row><row><cell>Super Tuesday</cell><cell cols="2">0.573 0.605 0.6</cell><cell cols="2">0.591 0.614 0.586</cell></row><row><cell>US Elections</cell><cell cols="4">0.627 0.761 0.744 0.761 0.772 0.797</cell></row><row><cell cols="5">Weighted Mean 0.654 0.715 0.735 0.736 0.734 0.763</cell></row><row><cell>Method</cell><cell></cell><cell cols="3">T-REC@5 K-PREC@5 K-REC@5</cell></row><row><cell></cell><cell></cell><cell cols="2">Super Tuesday</cell></row><row><cell cols="2">Apriori+Length</cell><cell>0.682</cell><cell>0.431</cell><cell>0.68</cell></row><row><cell cols="2">GMM+Length</cell><cell>0.682</cell><cell>0.327</cell><cell>0.753</cell></row><row><cell></cell><cell></cell><cell>FA Cup</cell><cell></cell></row><row><cell cols="2">Hierar.+Max</cell><cell>0.923</cell><cell>0.337</cell><cell>0.582</cell></row><row><cell cols="2">Hierar.+Length</cell><cell>0.923</cell><cell>0.317</cell><cell>0.582</cell></row><row><cell>GMM+Max</cell><cell></cell><cell>0.923</cell><cell>0.267</cell><cell>0.582</cell></row><row><cell cols="2">GMM+Length</cell><cell>0.923</cell><cell>0.162</cell><cell>0.673</cell></row><row><cell></cell><cell></cell><cell cols="2">US elections</cell></row><row><cell>GMM+Max</cell><cell></cell><cell>0.844</cell><cell>0.232</cell><cell>0.571</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Best results for the different datasets after evaluating top 5 topics per slot. T-REC, K-PREC, and K-REC refers to topic-recall and keywordprecision/recall respectively.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head></head><label></label><figDesc>"We know in our hearts that for the United States of America, the best is yet to come," says #Obama in victory speech.</figDesc><table><row><cell cols="2">Target topic</cell><cell></cell><cell>Ground</cell><cell>truth</cell><cell cols="3">Extracted keywords</cell><cell>Example tweet</cell></row><row><cell></cell><cell></cell><cell></cell><cell cols="2">keywords</cell><cell></cell><cell></cell></row><row><cell cols="3">Newt Gingrich says</cell><cell>Newt</cell><cell>Gingrich,</cell><cell>launch,</cell><cell cols="2">March,</cell><cell>Mo-</cell><cell>@Bailey Shel: RT @newtgin-</cell></row><row><cell cols="3">"Thank you Geor-</cell><cell>Thank</cell><cell>you,</cell><cell>mentum,</cell><cell cols="2">decisively,</cell><cell>grich: Thank you Georgia! It is</cell></row><row><cell cols="3">gia! It is gratify-</cell><cell cols="2">Georgia, March,</cell><cell cols="3">thank, Georgia, gratify-</cell><cell>gratifying to win my home state</cell></row><row><cell cols="3">ing to win my home</cell><cell cols="2">Momentum,</cell><cell cols="3">ing, win, home, state,</cell><cell>so decisively to launch our March</cell></row><row><cell cols="3">state so decisively</cell><cell cols="2">gratifying</cell><cell cols="2">#MarchMo,</cell><cell>#250gas,</cell><cell>Momentum. #MarchMo #250gas</cell></row><row><cell cols="3">to launch our March</cell><cell></cell><cell></cell><cell cols="2">@newtgingrich</cell></row><row><cell cols="2">Momentum"</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">Salomon Kalou has</cell><cell cols="2">Salomon Kalou,</cell><cell cols="3">Liverpool, defence, be-</cell><cell>@SharkbaitHooHa :</cell><cell>RT</cell></row><row><cell cols="3">an effort at goal</cell><cell cols="2">run, box, mazy</cell><cell cols="3">fore, gets, ambushed,</cell><cell>@chelseafc: Great mazy run</cell></row><row><cell cols="3">from outside the</cell><cell></cell><cell></cell><cell>Kalou,</cell><cell>box,</cell><cell>mazy,</cell><cell>by Kalou into the box but he</cell></row><row><cell>area</cell><cell>which</cell><cell>goes</cell><cell></cell><cell></cell><cell cols="3">run, @chelseafc, great,</cell><cell>gets ambushed by the Liverpool</cell></row><row><cell cols="3">wide right of the</cell><cell></cell><cell></cell><cell cols="3">#cfcwembley, #facup,</cell><cell>defence before he can shoot</cell></row><row><cell>goal</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>shoot</cell><cell></cell><cell>#CFCWembley #FACup</cell></row><row><cell cols="3">US President Barack</cell><cell>Obama,</cell><cell>best,</cell><cell cols="3">America, best, come,</cell><cell>@northoaklandnow:</cell></row><row><cell cols="3">Obama has pledged</cell><cell>come</cell><cell></cell><cell cols="3">United, States, hearts,</cell></row><row><cell cols="3">"the best is yet to</cell><cell></cell><cell></cell><cell cols="3">#Obama, speech, know,</cell></row><row><cell cols="3">come", following a</cell><cell></cell><cell></cell><cell>victory</cell><cell></cell></row><row><cell cols="3">decisive re-election</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">victory over Repub-</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">lican challenger Mitt</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">Romney</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">http://www.socialsensor.eu/results/datasets/72-twitter-tdt-dataset</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">"Twitter &amp; TV: Use the power of television to grow your impact" https:// business.twitter.com/twitter-tv</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments This work is supported by the SocialSensor FP7 project, partially funded by the EC under contract number 287975. We wish to thank Nic Newman and Steve Schifferes of City University London for invaluable advice.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Mainstream media and the distribution of news in the age of social discovery</title>
		<author>
			<persName><forename type="first">N</forename><surname>Newman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011-09">September 2011</date>
		</imprint>
	</monogr>
	<note type="report_type">Reuters Institute for the Study of Journalism working paper</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Sensing trending topics in twitter</title>
		<author>
			<persName><forename type="first">L</forename><surname>Aiello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Petkos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Corney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Papadopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Skraba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Goker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kompatsiaris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jaimes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Multimedia, IEEE Transactions on</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="1268" to="1282" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Finding newsworthy topics on Twitter</title>
		<author>
			<persName><forename type="first">C</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Corney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Goker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Computer Society Special Technical Community on Social Networking E-Letter</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">3</biblScope>
			<date type="published" when="2013-09">September 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">#ukelection2010, mainstream media and the role of the internet</title>
		<author>
			<persName><forename type="first">N</forename><surname>Newman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2010-07">July 2010</date>
		</imprint>
		<respStmt>
			<orgName>Reuters Institute for the Study of Journalism working</orgName>
		</respStmt>
	</monogr>
	<note>paper</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Streaming first story detection with application to Twitter</title>
		<author>
			<persName><forename type="first">S</forename><surname>Petrovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Osborne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Lavrenko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of NAACL</title>
				<meeting>NAACL</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="volume">10</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Using paraphrases for improving first story detection in news and Twitter</title>
		<author>
			<persName><forename type="first">S</forename><surname>Petrovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Osborne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Lavrenko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of HTL12 Human Language Technologies</title>
				<meeting>HTL12 Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="338" to="346" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Streaming trend detection in Twitter</title>
		<author>
			<persName><forename type="first">J</forename><surname>Benhardus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Artificial Intelligence, Natural Language Processing and Information Retrieval</title>
		<title level="s">National Science Foundation REU for</title>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
		<respStmt>
			<orgName>University of Colarado</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Peaks and persistence: modeling the shape of microblog conversations</title>
		<author>
			<persName><forename type="first">D</forename><surname>Shamma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kennedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Churchill</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM 2011 conference on Computer supported cooperative work</title>
				<meeting>the ACM 2011 conference on Computer supported cooperative work</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="355" to="358" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Breaking news detection and tracking in Twitter</title>
		<author>
			<persName><forename type="first">S</forename><surname>Phuvipadawat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Murata</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology</title>
				<meeting>the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="120" to="123" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Detecting a multi-level content similarity from microblogs based on community structures and named entities</title>
		<author>
			<persName><forename type="first">S</forename><surname>Phuvipadawat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Murata</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Emerging Technologies in Web Intelligence</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="11" to="19" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Event detection and tracking in social streams</title>
		<author>
			<persName><forename type="first">H</forename><surname>Sayyadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hurst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Maykov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of International Conference on Weblogs and Social Media (ICWSM)</title>
				<meeting>International Conference on Weblogs and Social Media (ICWSM)</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Semantic expansion of hashtags for enhanced event detection in Twitter</title>
		<author>
			<persName><forename type="first">O</forename><surname>Ozdikis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Senkul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Oguztuzun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of VLDB 2012 Workshop on Online Social Systems</title>
				<meeting>VLDB 2012 Workshop on Online Social Systems</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Detecting and tracking political abuse in social media</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ratkiewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Conover</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Meiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Gonçalves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Flammini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Menczer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of ICWSM</title>
				<meeting>of ICWSM</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Enblogue: emergent topic detection in Web 2.0 streams</title>
		<author>
			<persName><forename type="first">F</forename><surname>Alvanaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sebastian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ramamritham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Weikum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2011 international conference on Management of data</title>
				<meeting>the 2011 international conference on Management of data</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1271" to="1274" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Beyond trending topics: Real-world event identification on Twitter</title>
		<author>
			<persName><forename type="first">H</forename><surname>Becker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Naaman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gravano</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM11)</title>
				<meeting>the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM11)</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A statistical interpretation of term specificity and its application in retrieval</title>
		<author>
			<persName><forename type="first">Spärck</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Documentation</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="11" to="21" />
			<date type="published" when="1972">1972</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">A survey of recent advances in hierarchical clustering algorithms</title>
		<author>
			<persName><forename type="first">F</forename><surname>Murtagh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Computer Journal</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="354" to="359" />
			<date type="published" when="1983">1983</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Fast algorithms for mining association rules</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Srikant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 20th Int. Conf. Very Large Data Bases, VLDB</title>
				<meeting>20th Int. Conf. Very Large Data Bases, VLDB</meeting>
		<imprint>
			<date type="published" when="1994">1994</date>
			<biblScope unit="volume">1215</biblScope>
			<biblScope unit="page" from="487" to="499" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">How many clusters? Which clustering method? Answers via model-based cluster analysis</title>
		<author>
			<persName><forename type="first">C</forename><surname>Fraley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">E</forename><surname>Raftery</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Computer Journal</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="578" to="588" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Maximum likelihood from incomplete data via the EM algorithm</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Dempster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">M</forename><surname>Laird</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">B</forename><surname>Rubin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the Royal Statistical Society. Series B (Methodological</title>
		<imprint>
			<biblScope unit="page" from="1" to="38" />
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Latent Dirichlet Allocation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Jordan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="993" to="1022" />
			<date type="published" when="2003-03">Mar 2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Bieber no more: First story detection using Twitter and Wikipedia</title>
		<author>
			<persName><forename type="first">M</forename><surname>Osborne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Petrovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mccreadie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Macdonald</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ounis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR 2012 Workshop on Time-aware Information Access</title>
				<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Social networks in a battle for the second screen</title>
		<author>
			<persName><forename type="first">V</forename><surname>Goel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stelter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The New York Times</title>
		<imprint>
			<date type="published" when="2013-10-02">October 2 2013</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
