=Paper=
{{Paper
|id=Vol-1150/ifrim
|storemode=property
|title=Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering
|pdfUrl=https://ceur-ws.org/Vol-1150/ifrim.pdf
|volume=Vol-1150
|dblpUrl=https://dblp.org/rec/conf/www/IfrimSB14
}}
==Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering==
Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering Georgiana Ifrim Bichen Shi Igor Brigadir Insight Centre for Data Analytics University College Dublin Dublin, Ireland {georgiana.ifrim,bichen.shi,igor.brigadir}@insight-centre.org posted. Because of its real-time and global nature, many people use Twitter as a primary source of news Abstract content, in addition to sharing daily life, emotion and thoughts. Twitter has become as much of a news me- Journalists also increasingly adopt social media dia as a social network, and much research as professional tools and are gradually altering their has turned to analyzing its content for track- processes of news selection and presentation [Jor13, ing real-world events, from politics to sports Sch14]. They use Twitter to monitor the newsworthy and natural disasters. This paper describes stories that emerge from the crowd, and to find user- the techniques we employed for the SNOW generated content to enrich their stories. However, it Data Challenge 2014, described in [Pap14]. is very hard for a person to spot the useful information We show that aggressive filtering of tweets in Twitter without being overwhelmed by an endless based on length and structure, combined with stream of redundant tweets. hierarchical clustering of tweets and ranking of As a response to this problem and the SNOW Data the resulting clusters, achieves encouraging re- Challenge 2014, we propose a system to detect novel, sults. We present empirical results and discus- newsworthy topics/events as they are published on sion for two different Twitter streams focusing Twitter. Provided with a Twitter stream that is ini- on the US presidential elections in 2012 and tially filtered by a list of seed terms corresponding to the recent events about Ukraine, Syria and the known events (e.g., Ukraine) and possibly a list of Bitcoin, in February 2014. user ids, the system automatically mines the social Keywords: Event Detection, Twitter, Social stream, to provide a set of headlines and complemen- Media, Digital Journalism, News Aggregation tary information (photo and tweets) that summarize the topics for a number of time slots of interest. Al- 1 Introduction though Topic Detection and Tracking [All02] has been well-studied for static document corpora, in the social Micro-blogging platforms such as Twitter have media context there are a few new factors that make emerged in recent years, creating a radically new mode the problem more challenging, e.g., different language of communication between people. Every day, 500 mil- styles between Twitter and traditional news media, the lion users send more than 500 million tweets (as of fragmented and possibly ambiguous nature of tweets end 2013) [Tel13], on every possible topic. Interactions due to their 140 character length constraint, the high and communication in Twitter often reflect real-world amount of noise in the user-generated content and the events and dynamics, and important events like elec- real-time data processing aspect. tions, disasters, concerts, and football games can have In this paper, we present our topic detection immediate and direct impact on the volume of tweets approach: a combination of aggressive data pre- processing, hierarchical clustering of tweets, time- Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. dependent n-gram and cluster ranking and headlines In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings re-clustering. We analyze how factors such as event of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014, type, data pre-processing and parameters in the frame- published at http://ceur-ws.org work affect the quality of topic extraction results. The evaluation simulates a real-world application scenario, the 3- and 4-grams gave similar results and were 3 where the system works on the data of the live tweet times better than using unigrams. BNgram has good stream and produces (close to real-time) detected top- performance on topic recall as well as keywords re- ics in each user-specified time window (e.g., new head- call/precision, however, considering more top topics lines for every 15 minutes). The selected datasets cover and topic/time aggregation does not improve topic re- the US presidential Elections (2012) and recent events call. in Ukraine and Syria (2014). Other than the six methods compared in [Aie13], a keyword-lifecycle event detection framework was re- 2 Related Work cently introduced in [Mat13], in which a keyword’s standard behavior is modeled by its frequency and its Recently [Aie13] has compared several techniques for average daily behavior. An event is detected when a event detection in Twitter, and promoted a technique keyword’s frequency is abnormal. 80% of the strong based on term clustering for obtaining trending top- earthquakes are detected by this framework, and its ics. The six compared techniques in [Aie13] fit into false positive rate is very low. The Window Variation two main categories: document-clustering versus term- Keyword Burst Detection [Guz13] is another recent clustering, where a cluster represents a potential topic topic detection method. of interest. These approaches can be further catego- Building on recent work, we propose an approach rized into three different classes: probabilistic mod- based on tweet-clustering combined with a few layers els (e.g., Latent Dirichlet Allocation (LDA) ), classical of filtering, aggregation and ranking in order to de- Topic Detection and Tracking (e.g., Document-Pivot liver an efficient topic detection method. Our choice Topic Detection (FSD) ) and feature-pivot methods of tweet (vs term) clustering is based on the following (e.g., n-gram clustering (BNgram)). observations: (1) tweet clustering methods have shown LDA [Ble03] is a topic model that associates with high recall, in particular when allowing a higher num- each document a probability distribution over topics, ber of topics to be retrieved; (2) tweets are the main which are in turn distributions over words. Every doc- unit of content, lending themselves naturally to mean- ument is considered as a bag of terms, and the topic ingful and human-readable news-like headlines, while distribution per document as well as the term distri- term-clustering approaches have to deal with the chal- bution per topic are estimated through Bayesian infer- lenge of re-creating a meaningful unit of content (e.g., ence. According to results in [Aie13], LDA models can swapping the order of terms in a cluster can change capture stories happening during events with narrow the meaning of a headline-topic); (3) we can introduce topical scope, while their performance can be dramat- various tweet-importance metrics for re-weighting the ically lower when considering more noisy events. retrieved tweet-clusters, e.g., up-weighting tweets from FSD [Pet10] is designed to detect the first docu- trustworthy or high clout sources, such as journalists. ment discussing a topic in a large corpus via document clustering. It works with a document-term matrix, 3 Data Challenge Setup where coordinates represent the frequency of a partic- ular term in a document, and documents are clustered Details of the SNOW Data Challenge can be found in by cosine similarity of their tf − idf term representa- [Pap14]. tion. FSD uses Locality Sensitive Hashing to rapidly retrieve the nearest neighbor of a document. Although 4 Method Proposed the initial topic recall of plain FSD is not very high, The main approach behind our results for the data it can significantly improve when employing document challenge is based on: (1) Aggressive tweet and term aggregation [Aie13] via two stage clustering, to avoid filtering, to remove noisy tweets and vocabulary; (2) initial topic fragmentation (i.e., the same topic being Hierarchical clustering of tweets, dynamic dendrogram discussed by several clusters). cutting and ranking of the resulting clusters, to obtain As proposed in [Aie13], BNgram is a n-grams topics. feature-pivot method that clusters terms rather than We describe our method in detail in the follow- documents, where the distance between terms is de- ing subsections. For collecting the Twitter stream we fined by the proportion of documents where two terms used code provided by the SNOW challenge organizers co-occur. BNgram extracts topics in each time slot, [Pap14] based on the Twitter4J API1 . For all other de- and a time-dependent ranking is introduced to penalise velopment (e.g., data pre-processing, clustering, rank- topics that began in the past and are still popular in ing, producing final topics), we have used Python2.7 the present, via the use of a term burstiness score and available python libraries. We chose Python due (df − if dt ). [Mar13] has compared the performance when using different types of n-grams and found that 1 http://twitter4j.org/en/index.html to the ease of development and its available range of mentions or more than 2 hashtags, or less than 4 text powerful libraries (e.g., scipy, numpy, sklearn). In par- tokens. The idea behind this structure-based filter- ticular for tweet-NLP, e.g., named entity recognition, ing is that tweets that have many user mentions or we have used a Python wrapper (CMUTweetTagger li- hashtags, but lack enough clean text features, do not brary [Cmu14]), and for efficient hierarchical clustering carry enough news-like content, or are generally very of tweets, we have used the fastcluster library [Mue13]. noisy. This step filters many noisy tweets. For exam- Our code for topic detection is available online from ple, for the 15 minute time window, starting on 25 Feb https://github.com/heerme. 2014, at 18:00, and ending at 18:15, there are 12,589 raw tweets, out of which the first filtering step (that 4.1 Data Collection checks the length and structure of tweets) keeps only 9,487. Our initial tweet window corpus contains the We worked with two different Twitter streams, one above filtered tweets. about the US presidential elections in 2012, collected between 6 Nov 2012, starting at 23:30, and ending on 7 The next step is concerned with vocabulary filter- Nov 2012, at 6:30, and another collected starting on 25 ing. For each time window, from the window tweet Feb 2014, at 17:30 and ending on 26 Feb 2014, at 18:15. corpus, we create a (binary) tweet-term matrix, where The first stream was collected starting from tweet ids, we remove user mentions (but keep hashtags), and and had each tweet in the form of a text line, contain- the vocabulary terms are only bi-grams and tri-grams, ing the tweet GMT time, unix time stamp, id, user that occur in at least a number of tweets, where name, the text of the tweet, and whether the tweet is a the minimum is set to 10 tweets, and the maximum retweet or not. There were 1,084,200 (252MByte), En- threshold is set based on the window corpus length, glish and non-English tweets in this stream. In order to max(int(len(window corpus) ∗ 0.0025), 10). This to extract the user mentions, hashtags and urls from threshold does not grow very quickly, for example, for the text of the tweet, we used the twitter-text-python2 10,000 tweets, the term should occur in at least 25 library. For the second stream, the collected data is in tweets to be selected in the vocabulary. The idea be- JSON3 format, meaning each line of the output stream hind this filtering step, is that clusters should gather is a tweet encoded as a JSON object. This consisted enough tweets to be considered a topic at all (e.g., of 1,088,593 raw tweets (4.37GByte), out of which we at least 25 tweets in 10,000 tweets should discuss an only used 943,175 english tweets (3.87GByte), filtered event). For the above example, the term filtering step using the lang=’en’ field of the tweet JSON object. reduces the vocabulary to 670 terms, therefore we now We further processed each JSON object to extract, for have a matrix with 9,487 tweets by 670 terms. In each tweet, only the date, tweet id, text, user men- the next filtering step, we reduce this matrix to only tions, hashtags, urls and media urls, to a text file for the subset of rows containing at least 5 terms (tweets faster processing (240MByte). For re-tweets, we re- with at least 5 tokens from the vocabulary). This step place the text of the re-tweet with the original text of is meant to remove out-of-vocabulary tweets, as well the tweet that was re-tweeted (although we only do as tweets that are too short to be meaningfully clus- this for the tweets in JSON format, since the origi- tered. We varied the parameters for filtering tweets nal tweet text is included in the JSON object). We and terms, and noticed that the above chosen values use this text file, with one tweet per line, for all our were stable with regards to the topics produced. This experiments. third filtering step further reduces the original tweet- by-term matrix to 2,677 tweets and 670 terms, effec- 4.2 Data Pre-processing tively using only 20% of the original collection of raw tweets. We have found that for Twitter streams where An important part of our method is data pre- the language information is not available, e.g., for the processing and filtering. For each tweet, we pre- 2012 US presidential elections stream, it is much faster process the text as follows. We normalize the text to filter tweets and terms as above, therefore getting to remove urls, user mentions and hashtags, as well rid of most non-english tweets, than to apply a lan- as digits and other punctuation. Next, we tokenize guage identification library. the remaining clean text by white space, and remove stop words. In order to prepare the tweet corpus, in 4.3 Hierarchical Clustering of Tweets each time window, for each tweet, we first append the user mentions, the hashtags and the resulting clean In this section we give the detailed steps for our text tokens. We check the structure of the resulting method. tweet, and filter out tweets that have more than 2 user • Computing tweet pairwise distance. We 2 https://github.com/ianozsvald/twitter-text-python compute tweet pairwise distances and a hierarchi- 3 See https://dev.twitter.com/docs/entities for details cal clustering on the filtered tweet-by-term ma- trix. For pairwise distances we scale and nor- Setting the parameter t controls how much the malize the tweet-term matrix, and use cosine as history should affect the current weight of a term. a metric. Our experiments showed that by us- We set t = 4 in our approach, in order to allow ing euclidean distance we achieved similar results. for hourly updates (where a time window is set We use the sklearn and scipy python libraries for to 15 minutes). Note the log in the denominator, computing distances and the tweet-term matrix. allowing the current document frequency to have more weight than the previous/historical average • Computing hierarchical clustering. For frequency. computing a hierarchical clustering, we use the f astcluster library [Mue13] that can efficiently Another important focus is on tweet NLP in or- deal with thousands of tweets/terms. The idea der to recognize named entities. We experimented behind tweet clustering is that tweets belonging with the Stanford NLP [Sta14] and the nltk pos- to the same topic will cluster together, and thus tagger [Bir06, Bir9], but found that they many we can consider each cluster as a detected topic. times failed to recognize entities due to the spe- cific language of tweets, e.g., arbitrary capitaliza- • Cutting the dendrogram. Finally, we cut the tion of words (e.g., AWESOME vs obama, many resulting dendrogram at a 0.5 distance threshold. NER taggers rely on capitalization for clues on This threshold can control how tight or loose we potential entities [Li12]), short names (e.g., fb for require our final clusters to be, without having to Facebook). For this reason, we turned to the provide the number of clusters expected a-priori, CMU Twitter NLP and Part-of-Speech Tagging e.g., as for k-means or other popular clustering tool4 for recognizing entities [Gim11]. In partic- algorithms. A higher threshold would result in ular we used a python wrapper around the CMU looser clusters, that potentially collate different Java code [Cmu14]. This tool is trained on tweets topics in the same cluster. A lower threshold and had better accuracy for named entity recog- would result in tighter and cleaner clusters, but nition in our tests. We apply this tool to each potentially lead to too much topic fragmentation, of the terms in our vocabulary, in order to rec- i.e., the same topic would be reflected by lots of ognize entities. Once we compute the df − idft different clusters. We found that a value of 0.5 and identify the entities in the vocabulary of each works well for our method. time window, we assign each term a weight com- puted as df − idft ∗ entity boost, where the entity • Ranking the resulting clusters. Once we ob- boost was set to 2.5 in our case versus the 1.5 used tain clusters with the above procedure, we assign in [Aie13]. We found that a higher entity weight a score to each cluster and rank them based on leads to retrieving more news-like topics. Once that score. A first attempt was to score and rank the term weight is computed this way, each clus- clusters by size, therefore allowing clusters with a ter gets assigned the score of the term with high- lot of tweets to rank first as trending topics. This est weight (as in [Aie13]), but we normalize this results in topics that tend to be more casual and by the cluster size. This last normalization step are unlikely to make the news headlines (e.g., This seems to lead to less topic fragmentation, allow- is what happens when you put two pit bulls in a ing smaller clusters with prominent terms, to rank photo booth), as we show in our evaluation sec- higher. We have also experimented with cluster tion. Additionally, topics tend to get frequently scores that average the score of the terms of a repeated for several time windows, since we do not cluster. Interesting enough, when using unigrams consider potential term/topic burstiness in each rather than bi-grams and tri-grams for the vocab- time window with respect to the previous time ulary, ranking clusters by averaging term scores windows. worked better than the maximum term score. We Next, we introduce term weighting, based on the investigate these differences in cluster scoring in frequency in the time window, as well as boosting our experiments. We rank the clusters using this named entities. For the frequency based weight, score, and retain only top-20 clusters, subject to a we use the df − idft formula from [Aie13], that size constraint, e.g., for a cluster to be considered discounts the term-frequency in the current time a topic it should have at least 10 tweets. window using the average frequency in the pre- vious t time windows. The formula is shown in We have also attempted to assign a boost to terms Equation 1. based on their occurrence in news articles that are streamed in a similar time window as the tweets. dfi + 1 Nevertheless, this approach may work for some df − idft = Pt (1) j=i dfi−j log t +1 +1 4 http://www.ark.cs.cmu.edu/TweetNLP/ type of events, such as politics related, where the selecting tweet ids relevant to the extracted topic, news travel from the news outlets onto Twitter, we use the ids of the clustered headlines (i.e., the but may not work for events that first break on id of the tweet corresponding to the headline), and Twitter, such as sports events, that are later re- otherwise a single id, if the headline-cluster con- ported and summarized by the news outlets. For tains a single headline. The idea behind this strat- future work we intend to analyze the connection egy is that if the first stage of clustering did not between news articles and tweets streamed in the split a topic, the tweets inside the topic-cluster same time frame, and for certain type of events. were very similar to each other. For extracting Furthermore, we have attempted to use deeper urls of photos relevant to the topic, we first check NLP in the first stages of our development (e.g., if the headlines have any media url tags (as ex- pos-tagging and extracting nouns and verbs), but tracted from the JSON object), and if not, we minimal stop words removal and tweet clean- loop through the cluster (from stage 1) of tweets ing/filtering proved to be much more efficient and to which the headline belongs, in search of a me- equally accurate regarding topic detection. We dia url in those tweets. Restricting the number also found, as in [Aie13], that stemming hurts the of media urls to 1 or 2 directly affects the speed quality of topics retrieved, so we did not apply of the overall topic extraction process, since we stemming to our terms. don’t have to dive too deep into the previous (po- tentially large) clusters. • Selecting topic headlines. We select the first (with respect to publication time) tweet in each 5 Evaluation cluster of the top-20, as our headline for the detected topic. This clustering/ranking strat- To evaluate our approach, we use the subset of ground egy covers several events but many times suffers truth topics provided by the challenge organizers for from topic fragmentation, e.g., we may get sev- the 2012 US elections stream. For the second 2014 eral headlines about the same topic. This issue stream, where we were not provided with ground truth has also been found previously in [Aie13]. Next topics, we google for the automatically detected topic we discuss strategies for dealing with topic frag- headline and manually asses how many of our head- mentation and reducing the set of topics to only lines are published news in traditional media from the top-10. same time period (25-26 February 2014). We discuss our results for different choices of parameters, vocab- • Re-clustering headlines to avoid topic frag- ulary and cluster scoring functions. The official evalu- mentation. Our final step involves clustering of ation results of our method in the Data Challenge are only the headlines selected after the first stage of included in [Pap14]. clustering and ranking. These are cleaned tweets used for clustering in the first stage (no user 5.1 Results mentions, urls, filtered vocabulary). We build a headline-by-term matrix, using unigrams for Parameter Analysis. In this section we investigate our vocabulary, without any other restriction on the effect of various parameters on the resulting set terms. We re-cluster the headlines using hierar- of topics. For setting parameters we use the subset chical clustering, and cut the dendrogram at the of ground truth topics provided by the challenge orga- maximum distance (e.g., 1.0 for cosine). Again nizers for the 2012 stream, a sample of which is shown setting this threshold decides how many headlines in Table 1. For comparison, in Table 2, we show the we want to collate into a single topic. We rank top10 topics detected by our method (with parameters the resulting headline-clusters using the headline set as described in the previous section) for the same with the highest score inside a cluster, therefore stream, for the time slot starting at 07-11-2012 00:00. if the headlines do not cluster at all, the ranking In Table 3, we show the top10 topics produced by our of headlines will stay the same as in the previous method for the 2014 stream (parameters same as for step. Table 2), for the time window starting at 25-02-2014 18:00. • Final selection of topics From this final clus- Tweet Length and Structure. We relax the require- tering and ranking step, we select the headline ment that a tweet should be of length at least 5 in the with the earliest publication time, and present its final tweet-term matrix, to length at least 3. This leads raw tweet (without urls) as a final topic head- from the set of total tweets in window5 of 22,847, and line. We pool the keywords of the headlines in an initial tweet-term matrix with 12,684 tweets and the same headline-cluster to extract topic-tags (a list of keywords as a description of the topic). For 5 All numbers are for the time window of Table 2. Table 1: Example ground truth topics for the 2012 US elections Twitter stream. Time Topic Headline Topic Keywords Tweets Ids 07-11-12 00:00 Obama wins Vermont Obama,Vermont,wins,projects,VT 265966881926688768,265966897793740800 07-11-12 00:00 Romney wins Kentucky Romney,wins,Kentucky,projects,KY 265966833524424704,265966921537695744 07-11-12 00:00 Bernie Sanders wins Senate Sanders,wins,Senate,Vermont, 265967450074513408,265967599123316736 seat in Vermont independent,VT 07-11-12 00:00 Romney wins Indiana Romney,wins,Indiana,IN 265966811449810945,265966944522481665 07-11-12 00:30 Romney wins West Virginia Romney,wins,West Virginia,WV 265974256159039488,265974324148723712 07-11-12 00:30 Romney wins South Carolina Romney,wins,South Carolina,SC 265975742649729024,265975879736373248 07-11-12 01:00 Obama wins Illinois Obama,wins,Illinois,IL 265982157355376640,265982400880861184 07-11-12 01:00 Obama wins Connecticut Obama,wins,Connecticut,CT 265982401157689345,265982401795215360 07-11-12 01:00 Obama wins Maine Obama,wins,Maine,ME 265982400880861184,265982412897529857 Table 2: Detected top10 topics using our method for the 2012 US elections Twitter stream. Time Topic Headline Topic Keywords Tweets Ids 07-11-2012 00:00 WASHINGTON (AP) - Obama wins Vermont; Romney #election2012, @ap, ap, begins, breaking, calls, car- 265967355648167937, wins Kentucky. #Election2012 olina, close, cnn, fox, georgia, indiana, kentucky, news, 265967692161363969, obama, presidential, projects, race, romney, south, ver- 265967306985844736, mont, washington, wins 265967261297295361, 265967261297295361, 265967255815340032 07-11-2012 00:00 Not a shocker NBC reporting #Romney wins Indiana & #obama, #romney, indiana, kentucky, nbc, reporting, 265967338992570368 Kentucky #Obama wins Vermont vermont, wins 07-11-2012 00:00 RT @SkyNewsBreak: Sky News projection: Romney wins #election2012, @skynewsbreak, indiana, kentucky, news, 265967389974343680, Kentucky. #election2012 obama, 265967700734533633 07-11-2012 00:00 AP RACE CALL: Democrat Peter Shumlin wins governor #election2012, ap, bernie, call, democrat, governor, pe- 265968208291438592, race in Vermont. #Election2012 ter, race, sanders, seat, senate, shumlin, vermont, wins 265967599123316736 07-11-2012 00:00 CNN Virginia exit poll: Obama 49%, Romney 49% #elec- #election2012, cnn, exit, obama, poll, romney, virginia 265967764815110146 tion2012 07-11-2012 00:00 Mitt Romney Losing in Massachusetts a state that he gov- #Obama2012 #obama2012, governed, losing, mas- 265966841686544385 erned. Why vote for him when his own people don’t want sachusetts, mitt, people, romney, state, vote, want him? 07-11-2012 00:00 Twitter is gonna be live and popping when Obama wins! #obama2012, gonna, live, obama, popping, twitter, wins 265968524072218624 #Obama2012 07-11-2012 00:00 INDIANA RESULTS: Romney projected winner (via #dumbasses, #election2012, @huffingtonpost, @nbc, in- 265968527289249792, @NBC) #election2012 diana, projected, results, romney, winner 265968527289249792 07-11-2012 00:00 If Obama wins I’m going to celebrate... If Romney #obama2012, celebrate, going, last, obama, one, romney, 265966816730435584 wins I’m going to watch Sesame Street one last time sesame, street, time, watch, wins #Obama2012 07-11-2012 00:00 #election2012 important that Romney won INdependents #election2012, even, important, independents, inds, key, 265968665915191296 in Virginia by 11 pts. With parties about even, winning parties, pts, romney, virginia, winning, won Inds is key Table 3: Detected top10 topics using our method for the 2014 Syria, Ukraine, Bitcoin Twitter stream. Time Topic Headline Topic Keywords Tweets Ids Published News? 25-02-2014 18:00 The new, full Godzilla trailer has roared online: awesome, brand, full, godzilla, landed, new, on- 438373491440500737, YES line, roared, trailer 438373702573379584 25-02-2014 18:00 At half-time Borussia Dortmund lead Zenit #bbcfootball, #championsleague, @bbcsport, 438372831081279488 YES St Petersburg 2-0. #bbcfootball #Champi- borussia, dortmund, half, lead, petersburg, st, onsLeague time, zenit 25-02-2014 18:00 Ukraine Currency Hits Record Low Amid Un- amid, currency, hits, hryvnia, low, record, time, 438373672412143616 YES certainty: Ukrainian currency, the hryvnia, hits ukraine, ukrainian, uncertainty all-time low against ... 25-02-2014 18:00 Ooh, my back! Why workers’ aches pains are aches, back, economy, hurting, pains, uk, work- 438372908814303232 YES hurting the UK economy ers 25-02-2014 18:00 Uganda: how campaigners are preparing to anti, bill, campaigners, counter, gay, preparing, 438373369491505152 YES counter the anti-gay bill uganda 25-02-2014 18:00 JPost photographer snaps what must be the @jerometaylor, decade, hilarious, inadvertantly, 438372882088226816 YES most inadvertantly hilarious political picture of jpost, photographer, picture, political, snaps the decade 25-02-2014 18:00 Fans gather outside Ghostbusters firehouse in fans, firehouse, gather, ghostbusters, harold, 438375154008461313 YES N.Y.C. to pay tribute to Harold Ramis nyc, outside, pay, ramis, tribute 25-02-2014 18:00 Man survives a shooting because the Bible in @metrouk, bible, bullets, man, pocket, shoot- 438373191762059265 YES his top pocket stopped two bullets ing, stopped, survives, top, two 25-02-2014 18:00 #Ukraine’s toppling craze reaches even leg- #ukraine, commander, craze, even, fought, leg- 438374254002700288, YES endary Russian commander, who fought endary, napoleon, reaches, russian, toppling 438374829339987968 Napoleon 25-02-2014 18:00 Newcastle City Hall. Impressive booking first @robbrydon, booking, bottom, city, first, hall, 438372863377408000 NO from bottom on the left... impressive, left, newcastle 588 terms, and filtered tweet-term matrix with 3,258 grams, and keep all the other params fixed. This tweets, 588 terms to a tweet-term matrix with 3,777 leads to 9,028 tweets and 482 terms (as compared to tweets, and 588 terms. Therefore, we get 500 extra 3,258 tweets by 588 terms). This triples the number tweets when relaxing the tweet-length constraint. The of tweets that qualify for passing the filter conditions, effect on topics is nevertheless very low, we can thus thus making the topic detection process less efficient. keep an aggressive length filter without strongly affect- The topics detected with unigrams are fairly similar ing the final set of detected topics. to those detected using bi-grams and tri-grams, but the use of n-grams (n > 1) allows for more efficient Unigrams vs Bi-grams/Tri-grams. We change the processing. vocabulary to unigrams, rather than bi-grams and tri- Cluster Scoring. We investigate the effect of aver- choose the headline remains (e.g., we could show the aging term scores for computing a cluster score ver- collated cluster-headline). sus assigning the score of the maximum score term A big advantage of our method is its simplicity and in the cluster. We found that term score averaging for efficiency, since it runs in less than an hour for a full 24 computing a cluster score works better with unigrams, hour, 4GByte Twitter stream, therefore coming closer while assigning the maximum term score works better to real-time processing requirements. Strong filtering with n-grams. of tweets and terms seems to lead to efficient and clean Topic Precision. For the first stream with provided results, overcoming the heavy noise aspect of Twitter ground truth, we found that we can retrieve all the content. provided topics. In order to asses the quality of our For the future, we intend to compare our method detected topics for the second stream, where we lack to BNgrams [Aie13] and study the use of news articles ground truth, we googled for the first 100 detected top- and topic-focused streams to obtain a topic zoom-in ics (top10 of the first 10 time windows, of 15 minutes effect (e.g., topic detection on focused streams sepa- each), and evaluated how many were actually pub- rately: Ukraine vs Syria, and combining the topics in lished as news on sources other than Twitter. We the end). found that about 80% of our topics are published as news, by news media outlets (see also Table 3). Acknowledgments 5.2 Efficiency This work was supported by Science Foundation Ire- land under grant 07/CE/I1147 and SFI/12/RC/2289. The tweet clustering method presented above runs6 in around 1h for the full 24h data stream (96 time windows of 15 mins each). The most time consum- References ing parts are the tweet pairwise distance computa- [Aie13] Luca Maria Aiello, Georgios Petkos, Carlos tion and the hierarchical clustering, but we observed Martin, David Corney, Symeon Papadopoulos, that aggressive filtering of both tweets (based on struc- Ryan Skraba, Ayse Goker, Ioannis Kompat- ture/length) and terms (bi-grams and tri-grams) with siaris, and Alejandro Jaimes. Sensing trending strict thresholds on document frequency (minimum 10 topics in twitter. IEEE Transactions on Mul- tweets), can address the efficiency aspect. timedia, 2013. 6 Conclusion [All02] James Allan. Topic detection and track- ing: event-based information organization, vol- We present a method for topic detection in Twitter ume 12. Springer, 2002. streams, based on aggressive tweet/term filtering and two stage hierarchical clustering, first of tweets and [Bir06] Steven Bird. Nltk: the natural language second of resulting headlines from the first clustering toolkit. In Proceedings of the COLING/ACL step. The topics obtained seem encouraging, many of on Interactive presentation sessions, 2006. them being published as news in the traditional news media. Our topic-headlines are actual tweets, so the [Bir9] Steven Bird, Ewan Klein, and Edward Loper. user can trace the news back to its original tweet, and Natural language processing with Python. are presented in the context of photos (from tweet me- O’Reilly Media, Inc., 2009. dia urls) and tags selected from those tweets. One of the potential weaknesses of our method con- [Ble03] David M Blei, Andrew Y Ng, and Michael I sists in the aspect of topic fragmentation, where top- Jordan. Latent dirichlet allocation. Journal of ics get repeated across several clusters. This is most Machine Learning Research, 3:993–1022, 2003. pronounced when news break and the same story is [Cmu14] https://github.com/ianozsvald/ discussed from different points of view. We intend to ark-tweet-nlp. investigate this further. Additionally, some headlines may get collated into a single topic: for the US 2012 [Gim11] Kevin Gimpel, Nathan Schneider, Brendan elections stream, Peter Shumlin and Bernie Sanders O’Connor, Dipanjan Das, Daniel Mills, Jacob both running for governor and Senate seats in Vermont Eisenstein, Michael Heilman, Dani Yogatama, respectively, got collated into the same topic (see Table Jeffrey Flanigan, and Noah A Smith. Part- 2, headline about Peter Shumlin and topic keywords of-speech tagging for twitter: Annotation, fea- about both candidates), therefore the issue of how to tures, and experiments. In Annual Meeting of 6 On a PC with OS X 10.9.2, 8GByte memory and 2.7GHz the Association for Computational Linguistics: Intel CPU. Human Language Technologies, 2011. [Guz13] Jheser Guzman and Barbara Poblete. On- [Mue13] Daniel Muellner. fastcluster: Fast hierar- line relevant anomaly detection in the twitter chical, agglomerative clustering routines for r stream: an efficient bursty keyword detection and python. Journal of Statistical Software, model. In Proceedings of the ACM SIGKDD 53(9):1–18, 5 2013. Workshop on Outlier Detection and Descrip- tion, pages 31–39. ACM, 2013. [Pap14] Symeon Papadopoulos, David Corney, and Luca Maria Aiello. Snow 2014 data challenge: [Jor13] Marenet Jordaan. Poke me, i’m a journal- Assessing the performance of news topic detec- ist: The impact of facebook and twitter on tion methods in social media. In Proceedings newsroom routines and cultures at two south of the SNOW 2014 Data Challenge, 2014. african weeklies. Ecquid Novi: African Jour- nalism Studies, 34(1):21–35, 2013. [Pet10] Saša Petrović, Miles Osborne, and Victor Lavrenko. Streaming first story detection with [Li12] Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, application to twitter. In Human Language Anwitaman Datta, Aixin Sun, and Bu-Sung Technologies: Annual Conference of the North Lee. Twiner: named entity recognition in tar- American Chapter of the Association for Com- geted twitter stream. In ACM SIGIR, 2012. putational Linguistics, 2010. [Mar13] Carlos Martin, David Corney, and Ayse [Sch14] S. Schifferes, N. Newman, N. Thurman, Göker. Finding newsworthy topics on twit- D. Corney, A.S. Goker, and C. Martin. Identi- ter. IEEE Computer Society Special Techni- fying and verifying news through social media: cal Community on Social Networking E-Letter, Developing a user-centred tool for professional 2013. journalists. Digital Journalism, 2014. [Mat13] Tamas Matuszka, Zoltan Vinceller, and San- [Sta14] http://nlp.stanford.edu/software. dor Laki. On a keyword-lifecycle model for real- time event detection in social network data. [Tel13] http://www.telegraph.co.uk/technology/ In IEEE International Conference on Cogni- twitter/9945505/Twitter-in numbers.html. tive Infocommunications, 2013.