=Paper= {{Paper |id=Vol-1150/ifrim |storemode=property |title=Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering |pdfUrl=https://ceur-ws.org/Vol-1150/ifrim.pdf |volume=Vol-1150 |dblpUrl=https://dblp.org/rec/conf/www/IfrimSB14 }} ==Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering== https://ceur-ws.org/Vol-1150/ifrim.pdf
    Event Detection in Twitter using Aggressive Filtering
            and Hierarchical Tweet Clustering

                          Georgiana Ifrim                 Bichen Shi           Igor Brigadir
                                          Insight Centre for Data Analytics
                                              University College Dublin
                                                   Dublin, Ireland
                             {georgiana.ifrim,bichen.shi,igor.brigadir}@insight-centre.org



                                                                posted. Because of its real-time and global nature,
                                                                many people use Twitter as a primary source of news
                       Abstract                                 content, in addition to sharing daily life, emotion and
                                                                thoughts.
    Twitter has become as much of a news me-                        Journalists also increasingly adopt social media
    dia as a social network, and much research                  as professional tools and are gradually altering their
    has turned to analyzing its content for track-              processes of news selection and presentation [Jor13,
    ing real-world events, from politics to sports              Sch14]. They use Twitter to monitor the newsworthy
    and natural disasters. This paper describes                 stories that emerge from the crowd, and to find user-
    the techniques we employed for the SNOW                     generated content to enrich their stories. However, it
    Data Challenge 2014, described in [Pap14].                  is very hard for a person to spot the useful information
    We show that aggressive filtering of tweets                 in Twitter without being overwhelmed by an endless
    based on length and structure, combined with                stream of redundant tweets.
    hierarchical clustering of tweets and ranking of                As a response to this problem and the SNOW Data
    the resulting clusters, achieves encouraging re-            Challenge 2014, we propose a system to detect novel,
    sults. We present empirical results and discus-             newsworthy topics/events as they are published on
    sion for two different Twitter streams focusing             Twitter. Provided with a Twitter stream that is ini-
    on the US presidential elections in 2012 and                tially filtered by a list of seed terms corresponding to
    the recent events about Ukraine, Syria and the              known events (e.g., Ukraine) and possibly a list of
    Bitcoin, in February 2014.                                  user ids, the system automatically mines the social
    Keywords: Event Detection, Twitter, Social                  stream, to provide a set of headlines and complemen-
    Media, Digital Journalism, News Aggregation                 tary information (photo and tweets) that summarize
                                                                the topics for a number of time slots of interest. Al-
1    Introduction                                               though Topic Detection and Tracking [All02] has been
                                                                well-studied for static document corpora, in the social
Micro-blogging platforms such as Twitter have                   media context there are a few new factors that make
emerged in recent years, creating a radically new mode          the problem more challenging, e.g., different language
of communication between people. Every day, 500 mil-            styles between Twitter and traditional news media, the
lion users send more than 500 million tweets (as of             fragmented and possibly ambiguous nature of tweets
end 2013) [Tel13], on every possible topic. Interactions        due to their 140 character length constraint, the high
and communication in Twitter often reflect real-world           amount of noise in the user-generated content and the
events and dynamics, and important events like elec-            real-time data processing aspect.
tions, disasters, concerts, and football games can have             In this paper, we present our topic detection
immediate and direct impact on the volume of tweets             approach: a combination of aggressive data pre-
                                                                processing, hierarchical clustering of tweets, time-
Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes.
                                                                dependent n-gram and cluster ranking and headlines
In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings
                                                                re-clustering. We analyze how factors such as event
of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014,      type, data pre-processing and parameters in the frame-
published at http://ceur-ws.org                                 work affect the quality of topic extraction results. The
evaluation simulates a real-world application scenario,     the 3- and 4-grams gave similar results and were 3
where the system works on the data of the live tweet        times better than using unigrams. BNgram has good
stream and produces (close to real-time) detected top-      performance on topic recall as well as keywords re-
ics in each user-specified time window (e.g., new head-     call/precision, however, considering more top topics
lines for every 15 minutes). The selected datasets cover    and topic/time aggregation does not improve topic re-
the US presidential Elections (2012) and recent events      call.
in Ukraine and Syria (2014).                                   Other than the six methods compared in [Aie13], a
                                                            keyword-lifecycle event detection framework was re-
2   Related Work                                            cently introduced in [Mat13], in which a keyword’s
                                                            standard behavior is modeled by its frequency and its
Recently [Aie13] has compared several techniques for        average daily behavior. An event is detected when a
event detection in Twitter, and promoted a technique        keyword’s frequency is abnormal. 80% of the strong
based on term clustering for obtaining trending top-        earthquakes are detected by this framework, and its
ics. The six compared techniques in [Aie13] fit into        false positive rate is very low. The Window Variation
two main categories: document-clustering versus term-       Keyword Burst Detection [Guz13] is another recent
clustering, where a cluster represents a potential topic    topic detection method.
of interest. These approaches can be further catego-           Building on recent work, we propose an approach
rized into three different classes: probabilistic mod-      based on tweet-clustering combined with a few layers
els (e.g., Latent Dirichlet Allocation (LDA) ), classical   of filtering, aggregation and ranking in order to de-
Topic Detection and Tracking (e.g., Document-Pivot          liver an efficient topic detection method. Our choice
Topic Detection (FSD) ) and feature-pivot methods           of tweet (vs term) clustering is based on the following
(e.g., n-gram clustering (BNgram)).                         observations: (1) tweet clustering methods have shown
    LDA [Ble03] is a topic model that associates with       high recall, in particular when allowing a higher num-
each document a probability distribution over topics,       ber of topics to be retrieved; (2) tweets are the main
which are in turn distributions over words. Every doc-      unit of content, lending themselves naturally to mean-
ument is considered as a bag of terms, and the topic        ingful and human-readable news-like headlines, while
distribution per document as well as the term distri-       term-clustering approaches have to deal with the chal-
bution per topic are estimated through Bayesian infer-      lenge of re-creating a meaningful unit of content (e.g.,
ence. According to results in [Aie13], LDA models can       swapping the order of terms in a cluster can change
capture stories happening during events with narrow         the meaning of a headline-topic); (3) we can introduce
topical scope, while their performance can be dramat-       various tweet-importance metrics for re-weighting the
ically lower when considering more noisy events.            retrieved tweet-clusters, e.g., up-weighting tweets from
    FSD [Pet10] is designed to detect the first docu-       trustworthy or high clout sources, such as journalists.
ment discussing a topic in a large corpus via document
clustering. It works with a document-term matrix,           3     Data Challenge Setup
where coordinates represent the frequency of a partic-
ular term in a document, and documents are clustered        Details of the SNOW Data Challenge can be found in
by cosine similarity of their tf − idf term representa-     [Pap14].
tion. FSD uses Locality Sensitive Hashing to rapidly
retrieve the nearest neighbor of a document. Although       4     Method Proposed
the initial topic recall of plain FSD is not very high,
                                                            The main approach behind our results for the data
it can significantly improve when employing document
                                                            challenge is based on: (1) Aggressive tweet and term
aggregation [Aie13] via two stage clustering, to avoid
                                                            filtering, to remove noisy tweets and vocabulary; (2)
initial topic fragmentation (i.e., the same topic being
                                                            Hierarchical clustering of tweets, dynamic dendrogram
discussed by several clusters).
                                                            cutting and ranking of the resulting clusters, to obtain
    As proposed in [Aie13], BNgram is a n-grams             topics.
feature-pivot method that clusters terms rather than
                                                                We describe our method in detail in the follow-
documents, where the distance between terms is de-
                                                            ing subsections. For collecting the Twitter stream we
fined by the proportion of documents where two terms
                                                            used code provided by the SNOW challenge organizers
co-occur. BNgram extracts topics in each time slot,
                                                            [Pap14] based on the Twitter4J API1 . For all other de-
and a time-dependent ranking is introduced to penalise
                                                            velopment (e.g., data pre-processing, clustering, rank-
topics that began in the past and are still popular in
                                                            ing, producing final topics), we have used Python2.7
the present, via the use of a term burstiness score
                                                            and available python libraries. We chose Python due
(df − if dt ). [Mar13] has compared the performance
when using different types of n-grams and found that            1 http://twitter4j.org/en/index.html
to the ease of development and its available range of       mentions or more than 2 hashtags, or less than 4 text
powerful libraries (e.g., scipy, numpy, sklearn). In par-   tokens. The idea behind this structure-based filter-
ticular for tweet-NLP, e.g., named entity recognition,      ing is that tweets that have many user mentions or
we have used a Python wrapper (CMUTweetTagger li-           hashtags, but lack enough clean text features, do not
brary [Cmu14]), and for efficient hierarchical clustering   carry enough news-like content, or are generally very
of tweets, we have used the fastcluster library [Mue13].    noisy. This step filters many noisy tweets. For exam-
Our code for topic detection is available online from       ple, for the 15 minute time window, starting on 25 Feb
https://github.com/heerme.                                  2014, at 18:00, and ending at 18:15, there are 12,589
                                                            raw tweets, out of which the first filtering step (that
4.1   Data Collection                                       checks the length and structure of tweets) keeps only
                                                            9,487. Our initial tweet window corpus contains the
We worked with two different Twitter streams, one
                                                            above filtered tweets.
about the US presidential elections in 2012, collected
between 6 Nov 2012, starting at 23:30, and ending on 7         The next step is concerned with vocabulary filter-
Nov 2012, at 6:30, and another collected starting on 25     ing. For each time window, from the window tweet
Feb 2014, at 17:30 and ending on 26 Feb 2014, at 18:15.     corpus, we create a (binary) tweet-term matrix, where
The first stream was collected starting from tweet ids,     we remove user mentions (but keep hashtags), and
and had each tweet in the form of a text line, contain-     the vocabulary terms are only bi-grams and tri-grams,
ing the tweet GMT time, unix time stamp, id, user           that occur in at least a number of tweets, where
name, the text of the tweet, and whether the tweet is a     the minimum is set to 10 tweets, and the maximum
retweet or not. There were 1,084,200 (252MByte), En-        threshold is set based on the window corpus length,
glish and non-English tweets in this stream. In order       to max(int(len(window corpus) ∗ 0.0025), 10). This
to extract the user mentions, hashtags and urls from        threshold does not grow very quickly, for example, for
the text of the tweet, we used the twitter-text-python2     10,000 tweets, the term should occur in at least 25
library. For the second stream, the collected data is in    tweets to be selected in the vocabulary. The idea be-
JSON3 format, meaning each line of the output stream        hind this filtering step, is that clusters should gather
is a tweet encoded as a JSON object. This consisted         enough tweets to be considered a topic at all (e.g.,
of 1,088,593 raw tweets (4.37GByte), out of which we        at least 25 tweets in 10,000 tweets should discuss an
only used 943,175 english tweets (3.87GByte), filtered      event). For the above example, the term filtering step
using the lang=’en’ field of the tweet JSON object.         reduces the vocabulary to 670 terms, therefore we now
We further processed each JSON object to extract, for       have a matrix with 9,487 tweets by 670 terms. In
each tweet, only the date, tweet id, text, user men-        the next filtering step, we reduce this matrix to only
tions, hashtags, urls and media urls, to a text file for    the subset of rows containing at least 5 terms (tweets
faster processing (240MByte). For re-tweets, we re-         with at least 5 tokens from the vocabulary). This step
place the text of the re-tweet with the original text of    is meant to remove out-of-vocabulary tweets, as well
the tweet that was re-tweeted (although we only do          as tweets that are too short to be meaningfully clus-
this for the tweets in JSON format, since the origi-        tered. We varied the parameters for filtering tweets
nal tweet text is included in the JSON object). We          and terms, and noticed that the above chosen values
use this text file, with one tweet per line, for all our    were stable with regards to the topics produced. This
experiments.                                                third filtering step further reduces the original tweet-
                                                            by-term matrix to 2,677 tweets and 670 terms, effec-
4.2   Data Pre-processing                                   tively using only 20% of the original collection of raw
                                                            tweets. We have found that for Twitter streams where
An important part of our method is data pre-                the language information is not available, e.g., for the
processing and filtering. For each tweet, we pre-           2012 US presidential elections stream, it is much faster
process the text as follows. We normalize the text          to filter tweets and terms as above, therefore getting
to remove urls, user mentions and hashtags, as well         rid of most non-english tweets, than to apply a lan-
as digits and other punctuation. Next, we tokenize          guage identification library.
the remaining clean text by white space, and remove
stop words. In order to prepare the tweet corpus, in        4.3   Hierarchical Clustering of Tweets
each time window, for each tweet, we first append the
user mentions, the hashtags and the resulting clean         In this section we give the detailed steps for our
text tokens. We check the structure of the resulting        method.
tweet, and filter out tweets that have more than 2 user       • Computing tweet pairwise distance. We
  2 https://github.com/ianozsvald/twitter-text-python           compute tweet pairwise distances and a hierarchi-
  3 See https://dev.twitter.com/docs/entities for details       cal clustering on the filtered tweet-by-term ma-
  trix. For pairwise distances we scale and nor-            Setting the parameter t controls how much the
  malize the tweet-term matrix, and use cosine as           history should affect the current weight of a term.
  a metric. Our experiments showed that by us-              We set t = 4 in our approach, in order to allow
  ing euclidean distance we achieved similar results.       for hourly updates (where a time window is set
  We use the sklearn and scipy python libraries for         to 15 minutes). Note the log in the denominator,
  computing distances and the tweet-term matrix.            allowing the current document frequency to have
                                                            more weight than the previous/historical average
• Computing hierarchical clustering.            For         frequency.
  computing a hierarchical clustering, we use the
  f astcluster library [Mue13] that can efficiently         Another important focus is on tweet NLP in or-
  deal with thousands of tweets/terms. The idea             der to recognize named entities. We experimented
  behind tweet clustering is that tweets belonging          with the Stanford NLP [Sta14] and the nltk pos-
  to the same topic will cluster together, and thus         tagger [Bir06, Bir9], but found that they many
  we can consider each cluster as a detected topic.         times failed to recognize entities due to the spe-
                                                            cific language of tweets, e.g., arbitrary capitaliza-
• Cutting the dendrogram. Finally, we cut the               tion of words (e.g., AWESOME vs obama, many
  resulting dendrogram at a 0.5 distance threshold.         NER taggers rely on capitalization for clues on
  This threshold can control how tight or loose we          potential entities [Li12]), short names (e.g., fb for
  require our final clusters to be, without having to       Facebook). For this reason, we turned to the
  provide the number of clusters expected a-priori,         CMU Twitter NLP and Part-of-Speech Tagging
  e.g., as for k-means or other popular clustering          tool4 for recognizing entities [Gim11]. In partic-
  algorithms. A higher threshold would result in            ular we used a python wrapper around the CMU
  looser clusters, that potentially collate different       Java code [Cmu14]. This tool is trained on tweets
  topics in the same cluster. A lower threshold             and had better accuracy for named entity recog-
  would result in tighter and cleaner clusters, but         nition in our tests. We apply this tool to each
  potentially lead to too much topic fragmentation,         of the terms in our vocabulary, in order to rec-
  i.e., the same topic would be reflected by lots of        ognize entities. Once we compute the df − idft
  different clusters. We found that a value of 0.5          and identify the entities in the vocabulary of each
  works well for our method.                                time window, we assign each term a weight com-
                                                            puted as df − idft ∗ entity boost, where the entity
• Ranking the resulting clusters. Once we ob-
                                                            boost was set to 2.5 in our case versus the 1.5 used
  tain clusters with the above procedure, we assign
                                                            in [Aie13]. We found that a higher entity weight
  a score to each cluster and rank them based on
                                                            leads to retrieving more news-like topics. Once
  that score. A first attempt was to score and rank
                                                            the term weight is computed this way, each clus-
  clusters by size, therefore allowing clusters with a
                                                            ter gets assigned the score of the term with high-
  lot of tweets to rank first as trending topics. This
                                                            est weight (as in [Aie13]), but we normalize this
  results in topics that tend to be more casual and
                                                            by the cluster size. This last normalization step
  are unlikely to make the news headlines (e.g., This
                                                            seems to lead to less topic fragmentation, allow-
  is what happens when you put two pit bulls in a
                                                            ing smaller clusters with prominent terms, to rank
  photo booth), as we show in our evaluation sec-
                                                            higher. We have also experimented with cluster
  tion. Additionally, topics tend to get frequently
                                                            scores that average the score of the terms of a
  repeated for several time windows, since we do not
                                                            cluster. Interesting enough, when using unigrams
  consider potential term/topic burstiness in each
                                                            rather than bi-grams and tri-grams for the vocab-
  time window with respect to the previous time
                                                            ulary, ranking clusters by averaging term scores
  windows.
                                                            worked better than the maximum term score. We
  Next, we introduce term weighting, based on the           investigate these differences in cluster scoring in
  frequency in the time window, as well as boosting         our experiments. We rank the clusters using this
  named entities. For the frequency based weight,           score, and retain only top-20 clusters, subject to a
  we use the df − idft formula from [Aie13], that           size constraint, e.g., for a cluster to be considered
  discounts the term-frequency in the current time          a topic it should have at least 10 tweets.
  window using the average frequency in the pre-
  vious t time windows. The formula is shown in             We have also attempted to assign a boost to terms
  Equation 1.                                               based on their occurrence in news articles that are
                                                            streamed in a similar time window as the tweets.
                                    dfi + 1                 Nevertheless, this approach may work for some
         df − idft =          Pt                  (1)
                                j=i dfi−j
                       log          t       +1 +1         4 http://www.ark.cs.cmu.edu/TweetNLP/
  type of events, such as politics related, where the          selecting tweet ids relevant to the extracted topic,
  news travel from the news outlets onto Twitter,              we use the ids of the clustered headlines (i.e., the
  but may not work for events that first break on              id of the tweet corresponding to the headline), and
  Twitter, such as sports events, that are later re-           otherwise a single id, if the headline-cluster con-
  ported and summarized by the news outlets. For               tains a single headline. The idea behind this strat-
  future work we intend to analyze the connection              egy is that if the first stage of clustering did not
  between news articles and tweets streamed in the             split a topic, the tweets inside the topic-cluster
  same time frame, and for certain type of events.             were very similar to each other. For extracting
  Furthermore, we have attempted to use deeper                 urls of photos relevant to the topic, we first check
  NLP in the first stages of our development (e.g.,            if the headlines have any media url tags (as ex-
  pos-tagging and extracting nouns and verbs), but             tracted from the JSON object), and if not, we
  minimal stop words removal and tweet clean-                  loop through the cluster (from stage 1) of tweets
  ing/filtering proved to be much more efficient and           to which the headline belongs, in search of a me-
  equally accurate regarding topic detection. We               dia url in those tweets. Restricting the number
  also found, as in [Aie13], that stemming hurts the           of media urls to 1 or 2 directly affects the speed
  quality of topics retrieved, so we did not apply             of the overall topic extraction process, since we
  stemming to our terms.                                       don’t have to dive too deep into the previous (po-
                                                               tentially large) clusters.
• Selecting topic headlines. We select the first
  (with respect to publication time) tweet in each       5     Evaluation
  cluster of the top-20, as our headline for the
  detected topic. This clustering/ranking strat-         To evaluate our approach, we use the subset of ground
  egy covers several events but many times suffers       truth topics provided by the challenge organizers for
  from topic fragmentation, e.g., we may get sev-        the 2012 US elections stream. For the second 2014
  eral headlines about the same topic. This issue        stream, where we were not provided with ground truth
  has also been found previously in [Aie13]. Next        topics, we google for the automatically detected topic
  we discuss strategies for dealing with topic frag-     headline and manually asses how many of our head-
  mentation and reducing the set of topics to only       lines are published news in traditional media from the
  top-10.                                                same time period (25-26 February 2014). We discuss
                                                         our results for different choices of parameters, vocab-
• Re-clustering headlines to avoid topic frag-           ulary and cluster scoring functions. The official evalu-
  mentation. Our final step involves clustering of       ation results of our method in the Data Challenge are
  only the headlines selected after the first stage of   included in [Pap14].
  clustering and ranking. These are cleaned tweets
  used for clustering in the first stage (no user
                                                         5.1     Results
  mentions, urls, filtered vocabulary). We build
  a headline-by-term matrix, using unigrams for          Parameter Analysis. In this section we investigate
  our vocabulary, without any other restriction on       the effect of various parameters on the resulting set
  terms. We re-cluster the headlines using hierar-       of topics. For setting parameters we use the subset
  chical clustering, and cut the dendrogram at the       of ground truth topics provided by the challenge orga-
  maximum distance (e.g., 1.0 for cosine). Again         nizers for the 2012 stream, a sample of which is shown
  setting this threshold decides how many headlines      in Table 1. For comparison, in Table 2, we show the
  we want to collate into a single topic. We rank        top10 topics detected by our method (with parameters
  the resulting headline-clusters using the headline     set as described in the previous section) for the same
  with the highest score inside a cluster, therefore     stream, for the time slot starting at 07-11-2012 00:00.
  if the headlines do not cluster at all, the ranking    In Table 3, we show the top10 topics produced by our
  of headlines will stay the same as in the previous     method for the 2014 stream (parameters same as for
  step.                                                  Table 2), for the time window starting at 25-02-2014
                                                         18:00.
• Final selection of topics From this final clus-           Tweet Length and Structure. We relax the require-
  tering and ranking step, we select the headline        ment that a tweet should be of length at least 5 in the
  with the earliest publication time, and present its    final tweet-term matrix, to length at least 3. This leads
  raw tweet (without urls) as a final topic head-        from the set of total tweets in window5 of 22,847, and
  line. We pool the keywords of the headlines in         an initial tweet-term matrix with 12,684 tweets and
  the same headline-cluster to extract topic-tags (a
  list of keywords as a description of the topic). For       5 All numbers are for the time window of Table 2.
                    Table 1: Example ground truth topics for the 2012 US elections Twitter stream.
 Time                 Topic Headline                               Topic Keywords                                       Tweets Ids
 07-11-12 00:00       Obama wins Vermont                           Obama,Vermont,wins,projects,VT                       265966881926688768,265966897793740800
 07-11-12 00:00       Romney wins Kentucky                         Romney,wins,Kentucky,projects,KY                     265966833524424704,265966921537695744
 07-11-12 00:00       Bernie Sanders wins Senate                   Sanders,wins,Senate,Vermont,                         265967450074513408,265967599123316736
                      seat in Vermont                              independent,VT
 07-11-12 00:00       Romney wins Indiana                          Romney,wins,Indiana,IN                               265966811449810945,265966944522481665
 07-11-12 00:30       Romney wins West Virginia                    Romney,wins,West Virginia,WV                         265974256159039488,265974324148723712
 07-11-12 00:30       Romney wins South Carolina                   Romney,wins,South Carolina,SC                        265975742649729024,265975879736373248
 07-11-12 01:00       Obama wins Illinois                          Obama,wins,Illinois,IL                               265982157355376640,265982400880861184
 07-11-12 01:00       Obama wins Connecticut                       Obama,wins,Connecticut,CT                            265982401157689345,265982401795215360
 07-11-12 01:00       Obama wins Maine                             Obama,wins,Maine,ME                                  265982400880861184,265982412897529857


             Table 2: Detected top10 topics using our method for the 2012 US elections Twitter stream.
 Time               Topic Headline                                               Topic Keywords                                               Tweets Ids
 07-11-2012 00:00   WASHINGTON (AP) - Obama wins Vermont; Romney                 #election2012, @ap, ap, begins, breaking, calls, car-        265967355648167937,
                    wins Kentucky. #Election2012                                 olina, close, cnn, fox, georgia, indiana, kentucky, news,    265967692161363969,
                                                                                 obama, presidential, projects, race, romney, south, ver-     265967306985844736,
                                                                                 mont, washington, wins                                       265967261297295361,
                                                                                                                                              265967261297295361,
                                                                                                                                              265967255815340032
 07-11-2012 00:00   Not a shocker NBC reporting #Romney wins Indiana &           #obama, #romney, indiana, kentucky, nbc, reporting,          265967338992570368
                    Kentucky #Obama wins Vermont                                 vermont, wins
 07-11-2012 00:00   RT @SkyNewsBreak: Sky News projection: Romney wins           #election2012, @skynewsbreak, indiana, kentucky, news,       265967389974343680,
                    Kentucky. #election2012                                      obama,                                                       265967700734533633
 07-11-2012 00:00   AP RACE CALL: Democrat Peter Shumlin wins governor           #election2012, ap, bernie, call, democrat, governor, pe-     265968208291438592,
                    race in Vermont. #Election2012                               ter, race, sanders, seat, senate, shumlin, vermont, wins     265967599123316736
 07-11-2012 00:00   CNN Virginia exit poll: Obama 49%, Romney 49% #elec-         #election2012, cnn, exit, obama, poll, romney, virginia      265967764815110146
                    tion2012
 07-11-2012 00:00   Mitt Romney Losing in Massachusetts a state that he gov-     #Obama2012 #obama2012, governed, losing,              mas-   265966841686544385
                    erned. Why vote for him when his own people don’t want       sachusetts, mitt, people, romney, state, vote, want
                    him?
 07-11-2012 00:00   Twitter is gonna be live and popping when Obama wins!        #obama2012, gonna, live, obama, popping, twitter, wins       265968524072218624
                    #Obama2012
 07-11-2012 00:00   INDIANA RESULTS: Romney projected winner (via                #dumbasses, #election2012, @huffingtonpost, @nbc, in-        265968527289249792,
                    @NBC) #election2012                                          diana, projected, results, romney, winner                    265968527289249792
 07-11-2012 00:00   If Obama wins I’m going to celebrate...      If Romney       #obama2012, celebrate, going, last, obama, one, romney,      265966816730435584
                    wins I’m going to watch Sesame Street one last time          sesame, street, time, watch, wins
                    #Obama2012
 07-11-2012 00:00   #election2012 important that Romney won INdependents         #election2012, even, important, independents, inds, key,     265968665915191296
                    in Virginia by 11 pts. With parties about even, winning      parties, pts, romney, virginia, winning, won
                    Inds is key




     Table 3: Detected top10 topics using our method for the 2014 Syria, Ukraine, Bitcoin Twitter stream.
 Time               Topic Headline                                      Topic Keywords                                      Tweets Ids             Published
                                                                                                                                                   News?
 25-02-2014 18:00   The new, full Godzilla trailer has roared online:   awesome, brand, full, godzilla, landed, new, on-    438373491440500737,    YES
                                                                        line, roared, trailer                               438373702573379584
 25-02-2014 18:00   At half-time Borussia Dortmund lead Zenit           #bbcfootball, #championsleague, @bbcsport,          438372831081279488     YES
                    St Petersburg 2-0.      #bbcfootball #Champi-       borussia, dortmund, half, lead, petersburg, st,
                    onsLeague                                           time, zenit
 25-02-2014 18:00   Ukraine Currency Hits Record Low Amid Un-           amid, currency, hits, hryvnia, low, record, time,   438373672412143616     YES
                    certainty: Ukrainian currency, the hryvnia, hits    ukraine, ukrainian, uncertainty
                    all-time low against ...
 25-02-2014 18:00   Ooh, my back! Why workers’ aches pains are          aches, back, economy, hurting, pains, uk, work-     438372908814303232     YES
                    hurting the UK economy                              ers
 25-02-2014 18:00   Uganda: how campaigners are preparing to            anti, bill, campaigners, counter, gay, preparing,   438373369491505152     YES
                    counter the anti-gay bill                           uganda
 25-02-2014 18:00   JPost photographer snaps what must be the           @jerometaylor, decade, hilarious, inadvertantly,    438372882088226816     YES
                    most inadvertantly hilarious political picture of   jpost, photographer, picture, political, snaps
                    the decade
 25-02-2014 18:00   Fans gather outside Ghostbusters firehouse in       fans, firehouse, gather, ghostbusters, harold,      438375154008461313     YES
                    N.Y.C. to pay tribute to Harold Ramis               nyc, outside, pay, ramis, tribute
 25-02-2014 18:00   Man survives a shooting because the Bible in        @metrouk, bible, bullets, man, pocket, shoot-       438373191762059265     YES
                    his top pocket stopped two bullets                  ing, stopped, survives, top, two
 25-02-2014 18:00   #Ukraine’s toppling craze reaches even leg-         #ukraine, commander, craze, even, fought, leg-      438374254002700288,    YES
                    endary    Russian   commander,     who    fought    endary, napoleon, reaches, russian, toppling        438374829339987968
                    Napoleon
 25-02-2014 18:00   Newcastle City Hall. Impressive booking first       @robbrydon, booking, bottom, city, first, hall,     438372863377408000     NO
                    from bottom on the left...                          impressive, left, newcastle

588 terms, and filtered tweet-term matrix with 3,258                             grams, and keep all the other params fixed. This
tweets, 588 terms to a tweet-term matrix with 3,777                              leads to 9,028 tweets and 482 terms (as compared to
tweets, and 588 terms. Therefore, we get 500 extra                               3,258 tweets by 588 terms). This triples the number
tweets when relaxing the tweet-length constraint. The                            of tweets that qualify for passing the filter conditions,
effect on topics is nevertheless very low, we can thus                           thus making the topic detection process less efficient.
keep an aggressive length filter without strongly affect-                        The topics detected with unigrams are fairly similar
ing the final set of detected topics.                                            to those detected using bi-grams and tri-grams, but
                                                                                 the use of n-grams (n > 1) allows for more efficient
  Unigrams vs Bi-grams/Tri-grams. We change the                                  processing.
vocabulary to unigrams, rather than bi-grams and tri-
   Cluster Scoring. We investigate the effect of aver-      choose the headline remains (e.g., we could show the
aging term scores for computing a cluster score ver-        collated cluster-headline).
sus assigning the score of the maximum score term               A big advantage of our method is its simplicity and
in the cluster. We found that term score averaging for      efficiency, since it runs in less than an hour for a full 24
computing a cluster score works better with unigrams,       hour, 4GByte Twitter stream, therefore coming closer
while assigning the maximum term score works better         to real-time processing requirements. Strong filtering
with n-grams.                                               of tweets and terms seems to lead to efficient and clean
Topic Precision. For the first stream with provided         results, overcoming the heavy noise aspect of Twitter
ground truth, we found that we can retrieve all the         content.
provided topics. In order to asses the quality of our           For the future, we intend to compare our method
detected topics for the second stream, where we lack        to BNgrams [Aie13] and study the use of news articles
ground truth, we googled for the first 100 detected top-    and topic-focused streams to obtain a topic zoom-in
ics (top10 of the first 10 time windows, of 15 minutes      effect (e.g., topic detection on focused streams sepa-
each), and evaluated how many were actually pub-            rately: Ukraine vs Syria, and combining the topics in
lished as news on sources other than Twitter. We            the end).
found that about 80% of our topics are published as
news, by news media outlets (see also Table 3).             Acknowledgments

5.2   Efficiency                                            This work was supported by Science Foundation Ire-
                                                            land under grant 07/CE/I1147 and SFI/12/RC/2289.
The tweet clustering method presented above runs6
in around 1h for the full 24h data stream (96 time
windows of 15 mins each). The most time consum-
                                                            References
ing parts are the tweet pairwise distance computa-          [Aie13] Luca Maria Aiello, Georgios Petkos, Carlos
tion and the hierarchical clustering, but we observed              Martin, David Corney, Symeon Papadopoulos,
that aggressive filtering of both tweets (based on struc-          Ryan Skraba, Ayse Goker, Ioannis Kompat-
ture/length) and terms (bi-grams and tri-grams) with               siaris, and Alejandro Jaimes. Sensing trending
strict thresholds on document frequency (minimum 10                topics in twitter. IEEE Transactions on Mul-
tweets), can address the efficiency aspect.                        timedia, 2013.

6     Conclusion                                            [All02] James Allan.      Topic detection and track-
                                                                    ing: event-based information organization, vol-
We present a method for topic detection in Twitter                  ume 12. Springer, 2002.
streams, based on aggressive tweet/term filtering and
two stage hierarchical clustering, first of tweets and      [Bir06] Steven Bird. Nltk: the natural language
second of resulting headlines from the first clustering             toolkit. In Proceedings of the COLING/ACL
step. The topics obtained seem encouraging, many of                 on Interactive presentation sessions, 2006.
them being published as news in the traditional news
media. Our topic-headlines are actual tweets, so the        [Bir9] Steven Bird, Ewan Klein, and Edward Loper.
user can trace the news back to its original tweet, and            Natural language processing with Python.
are presented in the context of photos (from tweet me-             O’Reilly Media, Inc., 2009.
dia urls) and tags selected from those tweets.
   One of the potential weaknesses of our method con-       [Ble03] David M Blei, Andrew Y Ng, and Michael I
sists in the aspect of topic fragmentation, where top-             Jordan. Latent dirichlet allocation. Journal of
ics get repeated across several clusters. This is most             Machine Learning Research, 3:993–1022, 2003.
pronounced when news break and the same story is
                                                            [Cmu14] https://github.com/ianozsvald/
discussed from different points of view. We intend to
                                                                  ark-tweet-nlp.
investigate this further. Additionally, some headlines
may get collated into a single topic: for the US 2012       [Gim11] Kevin Gimpel, Nathan Schneider, Brendan
elections stream, Peter Shumlin and Bernie Sanders                O’Connor, Dipanjan Das, Daniel Mills, Jacob
both running for governor and Senate seats in Vermont             Eisenstein, Michael Heilman, Dani Yogatama,
respectively, got collated into the same topic (see Table         Jeffrey Flanigan, and Noah A Smith. Part-
2, headline about Peter Shumlin and topic keywords                of-speech tagging for twitter: Annotation, fea-
about both candidates), therefore the issue of how to             tures, and experiments. In Annual Meeting of
   6 On a PC with OS X 10.9.2, 8GByte memory and 2.7GHz           the Association for Computational Linguistics:
Intel CPU.                                                        Human Language Technologies, 2011.
[Guz13] Jheser Guzman and Barbara Poblete. On-           [Mue13] Daniel Muellner. fastcluster: Fast hierar-
      line relevant anomaly detection in the twitter           chical, agglomerative clustering routines for r
      stream: an efficient bursty keyword detection            and python. Journal of Statistical Software,
      model. In Proceedings of the ACM SIGKDD                  53(9):1–18, 5 2013.
      Workshop on Outlier Detection and Descrip-
      tion, pages 31–39. ACM, 2013.                      [Pap14] Symeon Papadopoulos, David Corney, and
                                                                Luca Maria Aiello. Snow 2014 data challenge:
[Jor13] Marenet Jordaan. Poke me, i’m a journal-                Assessing the performance of news topic detec-
       ist: The impact of facebook and twitter on               tion methods in social media. In Proceedings
       newsroom routines and cultures at two south              of the SNOW 2014 Data Challenge, 2014.
       african weeklies. Ecquid Novi: African Jour-
       nalism Studies, 34(1):21–35, 2013.                [Pet10] Saša Petrović, Miles Osborne, and Victor
                                                                Lavrenko. Streaming first story detection with
[Li12] Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao,            application to twitter. In Human Language
       Anwitaman Datta, Aixin Sun, and Bu-Sung                  Technologies: Annual Conference of the North
       Lee. Twiner: named entity recognition in tar-            American Chapter of the Association for Com-
       geted twitter stream. In ACM SIGIR, 2012.                putational Linguistics, 2010.
[Mar13] Carlos Martin, David Corney, and Ayse            [Sch14] S. Schifferes, N. Newman, N. Thurman,
      Göker. Finding newsworthy topics on twit-                D. Corney, A.S. Goker, and C. Martin. Identi-
      ter. IEEE Computer Society Special Techni-                fying and verifying news through social media:
      cal Community on Social Networking E-Letter,              Developing a user-centred tool for professional
      2013.                                                     journalists. Digital Journalism, 2014.
[Mat13] Tamas Matuszka, Zoltan Vinceller, and San-       [Sta14] http://nlp.stanford.edu/software.
      dor Laki. On a keyword-lifecycle model for real-
      time event detection in social network data.       [Tel13] http://www.telegraph.co.uk/technology/
      In IEEE International Conference on Cogni-                 twitter/9945505/Twitter-in numbers.html.
      tive Infocommunications, 2013.