<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Georgiana Ifrim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bichen Shi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor Brigadir</string-name>
          <email>igor.brigadirg@insight-centre.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics University College Dublin Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>Twitter has become as much of a news media as a social network, and much research has turned to analyzing its content for tracking real-world events, from politics to sports and natural disasters. This paper describes the techniques we employed for the SNOW Data Challenge 2014, described in [Pap14]. We show that aggressive ltering of tweets based on length and structure, combined with hierarchical clustering of tweets and ranking of the resulting clusters, achieves encouraging results. We present empirical results and discussion for two di erent Twitter streams focusing on the US presidential elections in 2012 and the recent events about Ukraine, Syria and the Bitcoin, in February 2014.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Micro-blogging platforms such as Twitter have
emerged in recent years, creating a radically new mode
of communication between people. Every day, 500
million users send more than 500 million tweets (as of
end 2013) [Tel13], on every possible topic. Interactions
and communication in Twitter often re ect real-world
events and dynamics, and important events like
elections, disasters, concerts, and football games can have
immediate and direct impact on the volume of tweets
posted. Because of its real-time and global nature,
many people use Twitter as a primary source of news
content, in addition to sharing daily life, emotion and
thoughts.</p>
      <p>Journalists also increasingly adopt social media
as professional tools and are gradually altering their
processes of news selection and presentation [Jor13,
Sch14]. They use Twitter to monitor the newsworthy
stories that emerge from the crowd, and to nd
usergenerated content to enrich their stories. However, it
is very hard for a person to spot the useful information
in Twitter without being overwhelmed by an endless
stream of redundant tweets.</p>
      <p>As a response to this problem and the SNOW Data
Challenge 2014, we propose a system to detect novel,
newsworthy topics/events as they are published on
Twitter. Provided with a Twitter stream that is
initially ltered by a list of seed terms corresponding to
known events (e.g., Ukraine) and possibly a list of
user ids, the system automatically mines the social
stream, to provide a set of headlines and
complementary information (photo and tweets) that summarize
the topics for a number of time slots of interest.
Although Topic Detection and Tracking [All02] has been
well-studied for static document corpora, in the social
media context there are a few new factors that make
the problem more challenging, e.g., di erent language
styles between Twitter and traditional news media, the
fragmented and possibly ambiguous nature of tweets
due to their 140 character length constraint, the high
amount of noise in the user-generated content and the
real-time data processing aspect.</p>
      <p>In this paper, we present our topic detection
approach: a combination of aggressive data
preprocessing, hierarchical clustering of tweets,
timedependent n-gram and cluster ranking and headlines
re-clustering. We analyze how factors such as event
type, data pre-processing and parameters in the
framework a ect the quality of topic extraction results. The
evaluation simulates a real-world application scenario,
where the system works on the data of the live tweet
stream and produces (close to real-time) detected
topics in each user-speci ed time window (e.g., new
headlines for every 15 minutes). The selected datasets cover
the US presidential Elections (2012) and recent events
in Ukraine and Syria (2014).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Recently [Aie13] has compared several techniques for
event detection in Twitter, and promoted a technique
based on term clustering for obtaining trending
topics. The six compared techniques in [Aie13] t into
two main categories: document-clustering versus
termclustering, where a cluster represents a potential topic
of interest. These approaches can be further
categorized into three di erent classes: probabilistic
models (e.g., Latent Dirichlet Allocation (LDA) ), classical
Topic Detection and Tracking (e.g., Document-Pivot
Topic Detection (FSD) ) and feature-pivot methods
(e.g., n-gram clustering (BNgram)).</p>
      <p>LDA [Ble03] is a topic model that associates with
each document a probability distribution over topics,
which are in turn distributions over words. Every
document is considered as a bag of terms, and the topic
distribution per document as well as the term
distribution per topic are estimated through Bayesian
inference. According to results in [Aie13], LDA models can
capture stories happening during events with narrow
topical scope, while their performance can be
dramatically lower when considering more noisy events.</p>
      <p>FSD [Pet10] is designed to detect the rst
document discussing a topic in a large corpus via document
clustering. It works with a document-term matrix,
where coordinates represent the frequency of a
particular term in a document, and documents are clustered
by cosine similarity of their tf idf term
representation. FSD uses Locality Sensitive Hashing to rapidly
retrieve the nearest neighbor of a document. Although
the initial topic recall of plain FSD is not very high,
it can signi cantly improve when employing document
aggregation [Aie13] via two stage clustering, to avoid
initial topic fragmentation (i.e., the same topic being
discussed by several clusters).</p>
      <p>As proposed in [Aie13], BNgram is a n-grams
feature-pivot method that clusters terms rather than
documents, where the distance between terms is
dened by the proportion of documents where two terms
co-occur. BNgram extracts topics in each time slot,
and a time-dependent ranking is introduced to penalise
topics that began in the past and are still popular in
the present, via the use of a term burstiness score
(df if dt). [Mar13] has compared the performance
when using di erent types of n-grams and found that</p>
    </sec>
    <sec id="sec-3">
      <title>Method Proposed</title>
      <p>The main approach behind our results for the data
challenge is based on: (1) Aggressive tweet and term
ltering, to remove noisy tweets and vocabulary; (2)
Hierarchical clustering of tweets, dynamic dendrogram
cutting and ranking of the resulting clusters, to obtain
topics.</p>
      <p>We describe our method in detail in the
following subsections. For collecting the Twitter stream we
used code provided by the SNOW challenge organizers
[Pap14] based on the Twitter4J API1. For all other
development (e.g., data pre-processing, clustering,
ranking, producing nal topics), we have used Python2.7
and available python libraries. We chose Python due
1http://twitter4j.org/en/index.html
the 3- and 4-grams gave similar results and were 3
times better than using unigrams. BNgram has good
performance on topic recall as well as keywords
recall/precision, however, considering more top topics
and topic/time aggregation does not improve topic
recall.</p>
      <p>Other than the six methods compared in [Aie13], a
keyword-lifecycle event detection framework was
recently introduced in [Mat13], in which a keyword's
standard behavior is modeled by its frequency and its
average daily behavior. An event is detected when a
keyword's frequency is abnormal. 80% of the strong
earthquakes are detected by this framework, and its
false positive rate is very low. The Window Variation
Keyword Burst Detection [Guz13] is another recent
topic detection method.</p>
      <p>Building on recent work, we propose an approach
based on tweet-clustering combined with a few layers
of ltering, aggregation and ranking in order to
deliver an e cient topic detection method. Our choice
of tweet (vs term) clustering is based on the following
observations: (1) tweet clustering methods have shown
high recall, in particular when allowing a higher
number of topics to be retrieved; (2) tweets are the main
unit of content, lending themselves naturally to
meaningful and human-readable news-like headlines, while
term-clustering approaches have to deal with the
challenge of re-creating a meaningful unit of content (e.g.,
swapping the order of terms in a cluster can change
the meaning of a headline-topic); (3) we can introduce
various tweet-importance metrics for re-weighting the
retrieved tweet-clusters, e.g., up-weighting tweets from
trustworthy or high clout sources, such as journalists.</p>
    </sec>
    <sec id="sec-4">
      <title>Data Challenge Setup 3 4</title>
      <p>Details of the SNOW Data Challenge can be found in
[Pap14].
to the ease of development and its available range of
powerful libraries (e.g., scipy, numpy, sklearn ). In
particular for tweet-NLP, e.g., named entity recognition,
we have used a Python wrapper (CMUTweetTagger
library [Cmu14]), and for e cient hierarchical clustering
of tweets, we have used the fastcluster library [Mue13].
Our code for topic detection is available online from
https://github.com/heerme.
4.1</p>
      <sec id="sec-4-1">
        <title>Data Collection</title>
        <p>We worked with two di erent Twitter streams, one
about the US presidential elections in 2012, collected
between 6 Nov 2012, starting at 23:30, and ending on 7
Nov 2012, at 6:30, and another collected starting on 25
Feb 2014, at 17:30 and ending on 26 Feb 2014, at 18:15.
The rst stream was collected starting from tweet ids,
and had each tweet in the form of a text line,
containing the tweet GMT time, unix time stamp, id, user
name, the text of the tweet, and whether the tweet is a
retweet or not. There were 1,084,200 (252MByte),
English and non-English tweets in this stream. In order
to extract the user mentions, hashtags and urls from
the text of the tweet, we used the twitter-text-python2
library. For the second stream, the collected data is in
JSON3 format, meaning each line of the output stream
is a tweet encoded as a JSON object. This consisted
of 1,088,593 raw tweets (4.37GByte), out of which we
only used 943,175 english tweets (3.87GByte), ltered
using the lang='en' eld of the tweet JSON object.
We further processed each JSON object to extract, for
each tweet, only the date, tweet id, text, user
mentions, hashtags, urls and media urls, to a text le for
faster processing (240MByte). For re-tweets, we
replace the text of the re-tweet with the original text of
the tweet that was re-tweeted (although we only do
this for the tweets in JSON format, since the
original tweet text is included in the JSON object). We
use this text le, with one tweet per line, for all our
experiments.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Data Pre-processing</title>
        <p>An important part of our method is data
preprocessing and ltering. For each tweet, we
preprocess the text as follows. We normalize the text
to remove urls, user mentions and hashtags, as well
as digits and other punctuation. Next, we tokenize
the remaining clean text by white space, and remove
stop words. In order to prepare the tweet corpus, in
each time window, for each tweet, we rst append the
user mentions, the hashtags and the resulting clean
text tokens. We check the structure of the resulting
tweet, and lter out tweets that have more than 2 user
2https://github.com/ianozsvald/twitter-text-python
3See https://dev.twitter.com/docs/entities for details
mentions or more than 2 hashtags, or less than 4 text
tokens. The idea behind this structure-based
ltering is that tweets that have many user mentions or
hashtags, but lack enough clean text features, do not
carry enough news-like content, or are generally very
noisy. This step lters many noisy tweets. For
example, for the 15 minute time window, starting on 25 Feb
2014, at 18:00, and ending at 18:15, there are 12,589
raw tweets, out of which the rst ltering step (that
checks the length and structure of tweets) keeps only
9,487. Our initial tweet window corpus contains the
above ltered tweets.</p>
        <p>The next step is concerned with vocabulary
ltering. For each time window, from the window tweet
corpus, we create a (binary) tweet-term matrix, where
we remove user mentions (but keep hashtags), and
the vocabulary terms are only bi-grams and tri-grams,
that occur in at least a number of tweets, where
the minimum is set to 10 tweets, and the maximum
threshold is set based on the window corpus length,
to max(int(len(window corpus) 0:0025); 10). This
threshold does not grow very quickly, for example, for
10,000 tweets, the term should occur in at least 25
tweets to be selected in the vocabulary. The idea
behind this ltering step, is that clusters should gather
enough tweets to be considered a topic at all (e.g.,
at least 25 tweets in 10,000 tweets should discuss an
event). For the above example, the term ltering step
reduces the vocabulary to 670 terms, therefore we now
have a matrix with 9,487 tweets by 670 terms. In
the next ltering step, we reduce this matrix to only
the subset of rows containing at least 5 terms (tweets
with at least 5 tokens from the vocabulary). This step
is meant to remove out-of-vocabulary tweets, as well
as tweets that are too short to be meaningfully
clustered. We varied the parameters for ltering tweets
and terms, and noticed that the above chosen values
were stable with regards to the topics produced. This
third ltering step further reduces the original
tweetby-term matrix to 2,677 tweets and 670 terms, e
ectively using only 20% of the original collection of raw
tweets. We have found that for Twitter streams where
the language information is not available, e.g., for the
2012 US presidential elections stream, it is much faster
to lter tweets and terms as above, therefore getting
rid of most non-english tweets, than to apply a
language identi cation library.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Hierarchical Clustering of Tweets</title>
        <p>In this section we give the detailed steps for our
method.</p>
        <p>Computing tweet pairwise distance. We
compute tweet pairwise distances and a
hierarchical clustering on the ltered tweet-by-term
matrix. For pairwise distances we scale and
normalize the tweet-term matrix, and use cosine as
a metric. Our experiments showed that by
using euclidean distance we achieved similar results.
We use the sklearn and scipy python libraries for
computing distances and the tweet-term matrix.
Computing hierarchical clustering. For
computing a hierarchical clustering, we use the
f astcluster library [Mue13] that can e ciently
deal with thousands of tweets/terms. The idea
behind tweet clustering is that tweets belonging
to the same topic will cluster together, and thus
we can consider each cluster as a detected topic.
Cutting the dendrogram. Finally, we cut the
resulting dendrogram at a 0.5 distance threshold.
This threshold can control how tight or loose we
require our nal clusters to be, without having to
provide the number of clusters expected a-priori,
e.g., as for k-means or other popular clustering
algorithms. A higher threshold would result in
looser clusters, that potentially collate di erent
topics in the same cluster. A lower threshold
would result in tighter and cleaner clusters, but
potentially lead to too much topic fragmentation,
i.e., the same topic would be re ected by lots of
di erent clusters. We found that a value of 0.5
works well for our method.</p>
        <p>Ranking the resulting clusters. Once we
obtain clusters with the above procedure, we assign
a score to each cluster and rank them based on
that score. A rst attempt was to score and rank
clusters by size, therefore allowing clusters with a
lot of tweets to rank rst as trending topics. This
results in topics that tend to be more casual and
are unlikely to make the news headlines (e.g., This
is what happens when you put two pit bulls in a
photo booth), as we show in our evaluation
section. Additionally, topics tend to get frequently
repeated for several time windows, since we do not
consider potential term/topic burstiness in each
time window with respect to the previous time
windows.</p>
        <p>Next, we introduce term weighting, based on the
frequency in the time window, as well as boosting
named entities. For the frequency based weight,
we use the df idft formula from [Aie13], that
discounts the term-frequency in the current time
window using the average frequency in the
previous t time windows. The formula is shown in
Equation 1.</p>
        <p>df
idft =</p>
        <p>dfi + 1
log</p>
        <p>Ptj=i dfi j + 1 + 1
t
(1)</p>
        <p>Setting the parameter t controls how much the
history should a ect the current weight of a term.
We set t = 4 in our approach, in order to allow
for hourly updates (where a time window is set
to 15 minutes). Note the log in the denominator,
allowing the current document frequency to have
more weight than the previous/historical average
frequency.</p>
        <p>Another important focus is on tweet NLP in
order to recognize named entities. We experimented
with the Stanford NLP [Sta14] and the nltk
postagger [Bir06, Bir9], but found that they many
times failed to recognize entities due to the
speci c language of tweets, e.g., arbitrary
capitalization of words (e.g., AWESOME vs obama, many
NER taggers rely on capitalization for clues on
potential entities [Li12]), short names (e.g., fb for
Facebook). For this reason, we turned to the
CMU Twitter NLP and Part-of-Speech Tagging
tool4 for recognizing entities [Gim11]. In
particular we used a python wrapper around the CMU
Java code [Cmu14]. This tool is trained on tweets
and had better accuracy for named entity
recognition in our tests. We apply this tool to each
of the terms in our vocabulary, in order to
recognize entities. Once we compute the df idft
and identify the entities in the vocabulary of each
time window, we assign each term a weight
computed as df idft entity boost, where the entity
boost was set to 2.5 in our case versus the 1.5 used
in [Aie13]. We found that a higher entity weight
leads to retrieving more news-like topics. Once
the term weight is computed this way, each
cluster gets assigned the score of the term with
highest weight (as in [Aie13]), but we normalize this
by the cluster size. This last normalization step
seems to lead to less topic fragmentation,
allowing smaller clusters with prominent terms, to rank
higher. We have also experimented with cluster
scores that average the score of the terms of a
cluster. Interesting enough, when using unigrams
rather than bi-grams and tri-grams for the
vocabulary, ranking clusters by averaging term scores
worked better than the maximum term score. We
investigate these di erences in cluster scoring in
our experiments. We rank the clusters using this
score, and retain only top-20 clusters, subject to a
size constraint, e.g., for a cluster to be considered
a topic it should have at least 10 tweets.</p>
        <p>We have also attempted to assign a boost to terms
based on their occurrence in news articles that are
streamed in a similar time window as the tweets.</p>
        <p>Nevertheless, this approach may work for some
4http://www.ark.cs.cmu.edu/TweetNLP/
type of events, such as politics related, where the
news travel from the news outlets onto Twitter,
but may not work for events that rst break on
Twitter, such as sports events, that are later
reported and summarized by the news outlets. For
future work we intend to analyze the connection
between news articles and tweets streamed in the
same time frame, and for certain type of events.
Furthermore, we have attempted to use deeper
NLP in the rst stages of our development (e.g.,
pos-tagging and extracting nouns and verbs), but
minimal stop words removal and tweet
cleaning/ ltering proved to be much more e cient and
equally accurate regarding topic detection. We
also found, as in [Aie13], that stemming hurts the
quality of topics retrieved, so we did not apply
stemming to our terms.</p>
        <p>Selecting topic headlines. We select the rst
(with respect to publication time) tweet in each
cluster of the top-20, as our headline for the
detected topic. This clustering/ranking
strategy covers several events but many times su ers
from topic fragmentation, e.g., we may get
several headlines about the same topic. This issue
has also been found previously in [Aie13]. Next
we discuss strategies for dealing with topic
fragmentation and reducing the set of topics to only
top-10.</p>
        <p>Re-clustering headlines to avoid topic
fragmentation. Our nal step involves clustering of
only the headlines selected after the rst stage of
clustering and ranking. These are cleaned tweets
used for clustering in the rst stage (no user
mentions, urls, ltered vocabulary). We build
a headline-by-term matrix, using unigrams for
our vocabulary, without any other restriction on
terms. We re-cluster the headlines using
hierarchical clustering, and cut the dendrogram at the
maximum distance (e.g., 1.0 for cosine). Again
setting this threshold decides how many headlines
we want to collate into a single topic. We rank
the resulting headline-clusters using the headline
with the highest score inside a cluster, therefore
if the headlines do not cluster at all, the ranking
of headlines will stay the same as in the previous
step.</p>
        <p>Final selection of topics From this nal
clustering and ranking step, we select the headline
with the earliest publication time, and present its
raw tweet (without urls) as a nal topic
headline. We pool the keywords of the headlines in
the same headline-cluster to extract topic-tags (a
list of keywords as a description of the topic). For
selecting tweet ids relevant to the extracted topic,
we use the ids of the clustered headlines (i.e., the
id of the tweet corresponding to the headline), and
otherwise a single id, if the headline-cluster
contains a single headline. The idea behind this
strategy is that if the rst stage of clustering did not
split a topic, the tweets inside the topic-cluster
were very similar to each other. For extracting
urls of photos relevant to the topic, we rst check
if the headlines have any media url tags (as
extracted from the JSON object), and if not, we
loop through the cluster (from stage 1) of tweets
to which the headline belongs, in search of a
media url in those tweets. Restricting the number
of media urls to 1 or 2 directly a ects the speed
of the overall topic extraction process, since we
don't have to dive too deep into the previous
(potentially large) clusters.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>To evaluate our approach, we use the subset of ground
truth topics provided by the challenge organizers for
the 2012 US elections stream. For the second 2014
stream, where we were not provided with ground truth
topics, we google for the automatically detected topic
headline and manually asses how many of our
headlines are published news in traditional media from the
same time period (25-26 February 2014). We discuss
our results for di erent choices of parameters,
vocabulary and cluster scoring functions. The o cial
evaluation results of our method in the Data Challenge are
included in [Pap14].
5.1</p>
      <sec id="sec-5-1">
        <title>Results</title>
        <p>Parameter Analysis. In this section we investigate
the e ect of various parameters on the resulting set
of topics. For setting parameters we use the subset
of ground truth topics provided by the challenge
organizers for the 2012 stream, a sample of which is shown
in Table 1. For comparison, in Table 2, we show the
top10 topics detected by our method (with parameters
set as described in the previous section) for the same
stream, for the time slot starting at 07-11-2012 00:00.
In Table 3, we show the top10 topics produced by our
method for the 2014 stream (parameters same as for
Table 2), for the time window starting at 25-02-2014
18:00.</p>
        <p>Tweet Length and Structure. We relax the
requirement that a tweet should be of length at least 5 in the
nal tweet-term matrix, to length at least 3. This leads
from the set of total tweets in window5 of 22,847, and
an initial tweet-term matrix with 12,684 tweets and
5All numbers are for the time window of Table 2.
588 terms, and ltered tweet-term matrix with 3,258
tweets, 588 terms to a tweet-term matrix with 3,777
tweets, and 588 terms. Therefore, we get 500 extra
tweets when relaxing the tweet-length constraint. The
e ect on topics is nevertheless very low, we can thus
keep an aggressive length lter without strongly a
ecting the nal set of detected topics.</p>
        <p>Unigrams vs Bi-grams/Tri-grams. We change the
vocabulary to unigrams, rather than bi-grams and
trigrams, and keep all the other params
xed.</p>
        <p>This
leads to 9,028 tweets and 482 terms (as compared to
3,258 tweets by 588 terms). This triples the number
of tweets that qualify for passing the lter conditions,
thus making the topic detection process less e cient.
The topics detected with unigrams are fairly similar
to those detected using bi-grams and tri-grams, but
the use of n-grams (n &gt; 1) allows for more e cient
processing.</p>
        <p>Cluster Scoring. We investigate the e ect of
averaging term scores for computing a cluster score
versus assigning the score of the maximum score term
in the cluster. We found that term score averaging for
computing a cluster score works better with unigrams,
while assigning the maximum term score works better
with n-grams.</p>
        <p>Topic Precision. For the rst stream with provided
ground truth, we found that we can retrieve all the
provided topics. In order to asses the quality of our
detected topics for the second stream, where we lack
ground truth, we googled for the rst 100 detected
topics (top10 of the rst 10 time windows, of 15 minutes
each), and evaluated how many were actually
published as news on sources other than Twitter. We
found that about 80% of our topics are published as
news, by news media outlets (see also Table 3).
5.2</p>
        <p>E</p>
        <p>ciency
The tweet clustering method presented above runs6
in around 1h for the full 24h data stream (96 time
windows of 15 mins each). The most time
consuming parts are the tweet pairwise distance
computation and the hierarchical clustering, but we observed
that aggressive ltering of both tweets (based on
structure/length) and terms (bi-grams and tri-grams) with
strict thresholds on document frequency (minimum 10
tweets), can address the e ciency aspect.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We present a method for topic detection in Twitter
streams, based on aggressive tweet/term ltering and
two stage hierarchical clustering, rst of tweets and
second of resulting headlines from the rst clustering
step. The topics obtained seem encouraging, many of
them being published as news in the traditional news
media. Our topic-headlines are actual tweets, so the
user can trace the news back to its original tweet, and
are presented in the context of photos (from tweet
media urls) and tags selected from those tweets.</p>
      <p>One of the potential weaknesses of our method
consists in the aspect of topic fragmentation, where
topics get repeated across several clusters. This is most
pronounced when news break and the same story is
discussed from di erent points of view. We intend to
investigate this further. Additionally, some headlines
may get collated into a single topic: for the US 2012
elections stream, Peter Shumlin and Bernie Sanders
both running for governor and Senate seats in Vermont
respectively, got collated into the same topic (see Table
2, headline about Peter Shumlin and topic keywords
about both candidates), therefore the issue of how to
6On a PC with OS X 10.9.2, 8GByte memory and 2.7GHz
Intel CPU.
choose the headline remains (e.g., we could show the
collated cluster-headline).</p>
      <p>A big advantage of our method is its simplicity and
e ciency, since it runs in less than an hour for a full 24
hour, 4GByte Twitter stream, therefore coming closer
to real-time processing requirements. Strong ltering
of tweets and terms seems to lead to e cient and clean
results, overcoming the heavy noise aspect of Twitter
content.</p>
      <p>For the future, we intend to compare our method
to BNgrams [Aie13] and study the use of news articles
and topic-focused streams to obtain a topic zoom-in
e ect (e.g., topic detection on focused streams
separately: Ukraine vs Syria, and combining the topics in
the end).</p>
      <sec id="sec-6-1">
        <title>Acknowledgments</title>
        <p>This work was supported by Science Foundation
Ireland under grant 07/CE/I1147 and SFI/12/RC/2289.
[Cmu14] https://github.com/ianozsvald/</p>
        <p>ark-tweet-nlp.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>[Aie13] Luca Maria</surname>
            <given-names>Aiello</given-names>
          </string-name>
          , Georgios Petkos, Carlos Martin,
          <string-name>
            <surname>David Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Symeon</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , Ryan Skraba, Ayse Goker, Ioannis Kompatsiaris, and
          <string-name>
            <given-names>Alejandro</given-names>
            <surname>Jaimes</surname>
          </string-name>
          .
          <article-title>Sensing trending topics in twitter</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [All02]
          <string-name>
            <given-names>James</given-names>
            <surname>Allan</surname>
          </string-name>
          .
          <article-title>Topic detection and tracking: event-based information organization</article-title>
          , volume
          <volume>12</volume>
          . Springer,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Bir06]
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <article-title>Nltk: the natural language toolkit</article-title>
          .
          <source>In Proceedings of the COLING/ACL on Interactive presentation sessions</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Bir9]
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          , Ewan Klein, and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          .
          <article-title>Natural language processing with Python.</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>[Ble03] David M Blei</surname>
            , Andrew Y Ng, and
            <given-names>Michael I</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          {
          <fpage>1022</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Gim11]
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Gimpel</surname>
          </string-name>
          , Nathan Schneider,
          <string-name>
            <surname>Brendan O'Connor</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dipanjan Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Daniel Mills</surname>
          </string-name>
          , Jacob Eisenstein, Michael Heilman, Dani Yogatama,
          <article-title>Je rey Flanigan, and Noah A Smith. Partof-speech tagging for twitter: Annotation, features, and experiments</article-title>
          .
          <source>In Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Guz13]
          <string-name>
            <given-names>Jheser</given-names>
            <surname>Guzman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Barbara</given-names>
            <surname>Poblete</surname>
          </string-name>
          .
          <article-title>Online relevant anomaly detection in the twitter stream: an e cient bursty keyword detection model</article-title>
          .
          <source>In Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description</source>
          , pages
          <volume>31</volume>
          {
          <fpage>39</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Jor13]
          <string-name>
            <given-names>Marenet</given-names>
            <surname>Jordaan</surname>
          </string-name>
          .
          <article-title>Poke me, i'm a journalist: The impact of facebook and twitter on newsroom routines and cultures at two south african weeklies</article-title>
          .
          <source>Ecquid Novi: African Journalism Studies</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ):
          <volume>21</volume>
          {
          <fpage>35</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Li12]
          <string-name>
            <given-names>Chenliang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jianshu</given-names>
            <surname>Weng</surname>
          </string-name>
          , Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and
          <string-name>
            <surname>Bu-Sung Lee</surname>
          </string-name>
          .
          <article-title>Twiner: named entity recognition in targeted twitter stream</article-title>
          .
          <source>In ACM SIGIR</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Mar13]
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>David</given-names>
            <surname>Corney</surname>
          </string-name>
          , and Ayse G
          <article-title>oker. Finding newsworthy topics on twitter</article-title>
          . IEEE Computer Society Special Technical Community on Social Networking E-Letter,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Mat13]
          <string-name>
            <given-names>Tamas</given-names>
            <surname>Matuszka</surname>
          </string-name>
          , Zoltan Vinceller, and
          <string-name>
            <given-names>Sandor</given-names>
            <surname>Laki</surname>
          </string-name>
          .
          <article-title>On a keyword-lifecycle model for realtime event detection in social network data</article-title>
          .
          <source>In IEEE International Conference on Cognitive Infocommunications</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Mue13]
          <article-title>Daniel Muellner. fastcluster: Fast hierarchical, agglomerative clustering routines for r and python</article-title>
          .
          <source>Journal of Statistical Software</source>
          ,
          <volume>53</volume>
          (
          <issue>9</issue>
          ):1{
          <issue>18</issue>
          , 5
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Pap14]
          <string-name>
            <given-names>Symeon</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , David Corney, and Luca Maria Aiello.
          <article-title>Snow 2014 data challenge: Assessing the performance of news topic detection methods in social media</article-title>
          .
          <source>In Proceedings of the SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Pet10]
          <string-name>
            <given-names>Sasa</given-names>
            <surname>Petrovic</surname>
          </string-name>
          , Miles Osborne, and
          <string-name>
            <given-names>Victor</given-names>
            <surname>Lavrenko</surname>
          </string-name>
          .
          <article-title>Streaming rst story detection with application to twitter</article-title>
          .
          <source>In Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [Sch14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schi eres</surname>
          </string-name>
          , N. Newman,
          <string-name>
            <given-names>N.</given-names>
            <surname>Thurman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.S.</given-names>
            <surname>Goker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Martin</surname>
          </string-name>
          .
          <article-title>Identifying and verifying news through social media: Developing a user-centred tool for professional journalists</article-title>
          .
          <source>Digital Journalism</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>[Sta14] http://nlp.stanford.edu/software.</mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>[Tel13] http://www.telegraph.co.uk/technology/ twitter/9945505/Twitter-in numbers.html.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>