=Paper= {{Paper |id=Vol-1150/petkos |storemode=property |title=Two-level Message Clustering for Topic Detection in Twitter |pdfUrl=https://ceur-ws.org/Vol-1150/petkos.pdf |volume=Vol-1150 |dblpUrl=https://dblp.org/rec/conf/www/PetkosPK14 }} ==Two-level Message Clustering for Topic Detection in Twitter== https://ceur-ws.org/Vol-1150/petkos.pdf
       Two-level message clustering for topic detection in
                            Twitter

              Georgios Petkos                  Symeon Papadopoulos                   Yiannis Kompatsiaris
                CERTH-ITI                          CERTH-ITI                             CERTH-ITI
            Thessaloniki, Greece                Thessaloniki, Greece                 Thessaloniki, Greece
               gpetkos@iti.gr                     papadop@iti.gr                         ikom@iti.gr



                                                                The rest of the paper is structured as follows. In Sec-
                                                                tion 2 we provide a brief overview of existing topic
                       Abstract                                 detection methods. Subsequently, Section 3 presents
                                                                our approach for treating the different aspects of the
    This paper presents our approach to the topic               challenge. Then, in Section 4 we present a prelimi-
    detection challenge organized by the 2014                   nary evaluation of the overall approach and present
    SNOW workshop. The applied approach uti-                    some topics produced by the overall approach and fi-
    lizes a document-pivot algorithm for topic de-              nally Section 5 concludes the paper.
    tection, i.e. it clusters documents and treats
    each cluster as a topic. We modify a previ-                 2      Related work
    ous version of a common document-pivot algo-
    rithm by considering specific features of tweets            At a very high level there are three different classes of
    that are strong indicators that particular sets             topic detection methods:
    of tweets belong to the same cluster. Addi-
                                                                    1. Document-pivot methods: these approaches clus-
    tionally, we recognize that the granularity of
                                                                       ter together documents using some measure of
    topics is an important factor to consider when
                                                                       document similarity, e.g. cosine similarity us-
    performing topic detection and we also take
                                                                       ing a bag of words representation and a tf -idf
    advantage of this when ranking topics.
                                                                       weighting scheme. For instance, the approach
                                                                       in [Pet10] is an approach that falls in this class
1    Introduction                                                      and uses a incremental, threshold-based cluster
                                                                       assignment procedure. That is, it examines each
This paper presents our approach to the topic detec-                   document in turn, it finds its best match from the
tion challenge organized by the 2014 SNOW workshop.                    already examined documents and either assigns it
Details about the challenge and the motivation behind                  to the same cluster as its best match or initializes
it can be found in [Pap14]. The task did not only in-                  a new cluster, depending on if the similarity to
volve topic detection per se, but it also required the                 the best match is above some threshold or not.
development of approaches related to the presentation                  Documents are compared using cosine similarity
of topics: topic ranking, relevant image retrieval, title              on tf -idf representations, while a Locality Sensi-
and keyword extraction. We present the solutions we                    tive Hashing (LSH) scheme is utilized in order to
applied to each of these problems. Open source imple-                  rapidly retrieve the best match. A variant of this
mentations of most of the methods used are already                     approach is utilized in this work.
available in a public repository1 and the rest will be
made available soon.                                                2. Feature-pivot methods: these approaches cluster
                                                                       together terms according to their cooccurrence
Copyright c by the paper’s authors. Copying permitted only             patterns. For instance, the algorithm presented
for private and academic purposes.
                                                                       in [?] performs a sequence of signal processing op-
In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings
of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014,
                                                                       erations on a tf -idf -like representation of term oc-
published at http://ceur-ws.org                                        currence through time in order to select the most
   1 https://github.com/socialsensor/topic-detection                   “bursty” terms. Subsequently, the distribution of
       appearance of the selected terms through time is      tion and enrichment: topic ranking, title and keyword
       modelled using a mixture of Gaussians. Even-          extraction, as well as retrieval of relevant tweets and
       tually, a cooccurrence measure between terms is       multimedia. In the following we present the pursued
       computed using the KL-Divergence of the corre-        approaches for each of these problems.
       sponding distributions and terms are clustered us-
       ing a greedy procedure based on this measure.         3.1   Pre-processing
    3. Probabilistic topic models: these represent the       The pre-processing phase of the employed solution in-
       joint distribution of topics and terms using a gen-   volves duplicate item aggregation and language-based
       erative probabilistic model which has a set of la-    filtering. Duplicate item aggregation is carried out
       tent variables that represent topics, terms, hyper-   because tweets posted on Twitter are often either
       parameters, etc. Probably, the most commonly          retweets or copies of previous messages. Thus, it
       used probabilistic topic model and one that has       makes sense, for computational efficiency reasons, to
       been extended in many ways is LDA [Ble03]. LDA        process in subsequent steps only a single copy of each
       uses hidden variables that represent the per-topic    duplicate item, while also keeping the number (and
       term distribution and the per-document topic dis-     ids) of occurrences for each of them. We implemented
       tribution. A concise review of probabilistic topic    this by hashing the text of each tweet and only keep-
       models can be found in [Ble12].                       ing the text of one tweet per bucket. In practice,
                                                             we observed that we obtained a significant computa-
For a more thorough review of existing topic detection       tional gain by doing this (the computational cost of
methods please see [Aie13].                                  the hashing procedure is very small). Indicatively, for
Two of the most important problems for topic detec-          the first test timeslot, the instance of our crawler col-
tion are fragmentation and merging of topics. Frag-          lected 15,090 tweets and after duplicate removal we
mentation occurs when the same actual story / topic is       ended up with roughly half of them: 7,546 tweets in
represented by many different produced topics. This is       particular. It should be also noted that the hashing
quite common in document-pivot methods, such as the          scheme we utilized did put in the same bucket all ex-
one that we build upon (e.g. if the threshold is set too     act duplicates but not near-duplicates. For instance,
high). Merging is in some sense the opposite of frag-        cases when a user copies a message but adds or re-
mentation, i.e. it occurs when many distinct topics,         moves some characters are not typically captured as
not related to each other, are represented by a single       duplicates. It is possible though to modify the pre-
topic. In the case of document-pivot methods, merging        processing so that most such cases are also captured,
may occur when the threshold is set too low. In that         e.g. one could filter out the “RT” string and the user
case, it is possible that the occurrence of terms that       mentions and repeat the same hashing procedure or
are not important for a topic may result in two docu-        one could detect near duplicates using Jaccard simi-
ments related to different topics being matched. These       larity (using also an inverse index for speed). These
merged topics may either be higher level topics of re-       options were briefly tested but thorough testing and
lated lower level topics or may be mixed topics of lower     deployment has been deferred.
level topics that are not related to each other, depend-     The second step involves language detection. We use
ing on the features on which the assignment of tweets        a public Java implementation2 , which provided almost
to clusters has occurred. The first case may be accept-      perfect detection accuracy. As dictated by the chal-
able depending on the required granularity of topics,        lenge guidelines, we only keep content in the English
but the second case is undesirable as it will produce        language. This further reduces the number of tweets
topics that are inconsistent and of limited use to the       that needs to be processed in futher steps. For in-
end user. Thus, it is crucial for document-pivot meth-       stance, in the first timeslot, after the removal of non-
ods to both do the matching based on the important           English tweets we end up with 6,359 tweets (from 7,546
textual features and to select the threshold appropri-       non-duplicate tweets that were tested).
ately. From an end user’s perspective, fragmentation
is bad because it results in redundant and overly spe-
                                                             3.2   Topic detection
cific topics, whereas merging has a much more negative
effect as it is quite likely to provide incomprehensible     Having a collection of tweets (with duplicates and non-
topics.                                                      English tweets removed), we now proceed to detect
                                                             topics in it. In previous work [Aie13], we experimented
3      Approach                                              with all three classes of methods. All present many
                                                             challenges when applied to a dataset retrieved from
The challenge did not only involve topic detection per
se; it also involved various aspects of topic presenta-        2 https://code.google.com/p/language-detection/
Twitter. The main reason is that Twitter messages            produced by the document-pivot procedure. Addi-
are very short. For document-pivot methods this ex-          tionally, a second-level cluster may also contain tweets
acerbates the problem of fragmentation, as it is more        that were not members of a first-level cluster and also
likely, at least compared to longer documents, that          second-level clusters may be created from tweets that
although a pair of messages discusses the same topic,        did not belong in first-level clusters.
there may not be enough terms present in both of them        In practice, by inspection of the results of early ex-
to link them. For feature-pivot methods, the problem         periments, it turns out that there still is some frag-
with short documents is very similar: i.e. in short doc-     mentation: some topics are represented by multiple
uments it is more likely that all terms that represent       second-level clusters. Therefore we seeked ways to re-
a topic will not cooccur frequently enough in order          duce this fragmentation.
to be clustered together. In this work, we opt for a         We first experimented with a semantic representation,
document-pivot approach, similar to that of [Pet10],         utilizing WordNet. In particular, instead of represent-
but we modify it in order to take advantage of some          ing the documents with a plain bag of words represen-
features that can significantly improve the document         tation that uses the raw textual features, we tried to
clustering procedure. In particular, we recognize two        use the synsets of the verbs and nouns in each doc-
facts: a) tweets that contain the same URL refer to          ument. Such a representation could improve the re-
the same topic and b) a tweet and a reply to it refer to     sults, since it would introduce some semantics in the
the same topic. Therefore, we can immediately cluster        document matching procedure and could match doc-
together tweets that contain the same URL and we can         uments that do not contain the same raw terms. In
also cluster together tweets with their replies. Consid-     practice, preliminary results showed that this is indeed
ering that there will be cases that these initial clusters   true; however it is also very likely to have the oppo-
will contain tweets that do not contain the same tex-        site effect: i.e. topic merging. Eventually, we dropped
tual features, we can expect that taking into account        the idea of using WordNet features to represent docu-
such information should improve the results of a pure        ments and pursued a more moderate approach in order
document-pivot approach by reducing fragmentation.           to deal with fragmentation.
Thus, the idea is to perform some first-level grouping       This consisted of two things. First, we utilized lem-
of items based on the above features, which will sub-        matized terms instead of raw terms in order to be able
sequently be taken into account as part of a second-         to better match terms. We also considered the use of
level document-pivot procedure. In order to obtain           stemming, but stemming is a much less reliable pro-
the first-level grouping, we utilize a Union-Find struc-     cess and may introduce false matches. Additionally,
ture [Cor09]. Essentially, we create a graph that con-       we recognize that some features are more important
tains one node for each tweet, connect pairs of tweets       than others for text matching. These features include
that contain the same URL or that are related by             named entities and hashtags. We use a tf -idf repre-
a reply and obtain the set of connected components.          sentation of documents and we boost the terms that
Components that have more than one tweets are the            correspond to named entities and hashtags by some
first-level groups that we will subsequently use in our      constant factor (1.5 in our experiments, later we will
second-level clustering procedure. Clearly, a large          also examine the effect of using non-constant boost
number of tweets, those that are the only members            factors). More formally, for the stemmed term j in
of a component with a single element, are not put into       the ith document we compute the tf -idfij weight as
any first-level cluster. Those tweets are not discarded      follows:
and are also considered in the second-level clustering                   (
algorithm.                                                          j      1.5 × tfij × idf j , if j is entity or hashtag
                                                             tf -idfi =
The algorithm employed for the second-level clustering                     tfij × idf j ,       otherwise
is similar to that of [Pet10] (i.e. we use an incremen-                                                               (1)
tal threshold based clustering procedure and LSH for
fast retrieval), but has some modifications. We take         where tfij is the frequency of the term in the docu-
into account the first-level clustering by examining if      ment and idf j is the inverse document frequency of
each new tweet to be clustered (it is reminded that          the term in an independent randomly collected corpus
all tweets are examined, either they belong to some          (more details on this corpus will be provided later).
first-level cluster or not) has been assigned to a first-    For lemmatization and named entity recognition we
level cluster and if it has, the other tweets from the       utilize the Stanford Core NLP library 3 .
first-level cluster are immediately assigned to the same     Finally, as we mentioned before, the threshold value is
second-level cluster (and are not further examined in        an important parameter of the process. We opt for a
subsequent clustering steps). Thus, all the first-level      high threshold (0.9) so that there is no merging, at the
clusters become members of the second-level clusters           3 http://nlp.stanford.edu/software/corenlp.shtml
cost of some fragmentation (despite the modifications               where cfj is the frequency of appearance of the entity
that we did to avoid it). As will be shown in a later               or hashtag j in the test corpus. This significantly re-
section, where we present some empirical results, the               duces the number of produced topics (1,345 for the first
produced topics are quite clear, meaning that there                 timeslot whereas 2,669 topics were produced from the
is no merging, and come at the appropriate level of                 second-level clustering for the same timeslot) and by
granularity.                                                        inspection it appears that it reduces fragmentation a
                                                                    lot. Importantly, merging takes place, but only related
3.3    Ranking                                                      topics are merged into clean higher level topics. For
                                                                    example, the algorithm manages to put all documents
The challenge required that only 10 topics per times-               related to Ukraine to the same cluster. Subsequently,
lot are returned. The preliminary tweet grouping step               we rank these high-level topics by the number of doc-
resulted in a few hundred first-level topics (483 for the           uments in the corresponding clusters and link each
first timeslot). When we apply the document-pivot                   second-level topic produced by the initial document-
clustering procedure we end up with considerably more               pivot procedure to the corresponding high-level topic.
second-level topics (2,669 for the first timeslot using a           The linking is carried out by finding which high-level
threshold of 0.9). Although, as verified by inspection,             topic contains the largest number of tweets of each
there still is some fragmentation , the number of actual            second-level topic. Finally, we rank all second-level
topics is quite large. Thus, we need to rank the pro-               topics belonging to the same high-level topic accord-
duced second-level topics in order to select the most               ing to the number of tweets they contain. Eventually,
important.                                                          we have a two-level clustering, one for high-level topics
Initially, we considered to simply rank the topics ac-              and one for the low/second-level topics within each of
cording to the number of documents they include and                 them. In order to select the 10 topics to return, we
the number of retweets these documents receive. How-                apply a simple heuristic procedure with the number
ever, we realized that the granularity and hierarchy of             of low level topics selected from each high-level topic
topics is also important for topic ranking. As already              dropping linearly as its rank drops. More specifically,
discussed, some topics may be considered as subtopics               we apply the following procedure. First we examine
of larger topics and it is reasonable that the attention            only the top-ranked high-level topic and select a single
that a larger topic attracts should affect the ranking              low-level topic from it. Then, we examine the top two
of related finer topics. For instance, the most popular             high-level topics and select one low-level cluster from
high-level topic in our corpus are the events in Ukraine            each of them and so on until we obtain 10 topics. Of
(this was determined in an early exploratory stage of               course, selected second-level topics are not reconsid-
our study by examining the ratio of likelihood of ap-               ered for selection when a high-level topic is revisited
pearance – for more details on this likelihood please               during the described procedure. Also, in case that
see the section on title extraction – of terms in the               there are not enough low-level topics in some high-
test corpus and in an independent randomly collected                level topic we just skip it.
corpus, the term “Ukraine” had the highest ratio). It               It should also be noted that we attempted to pro-
makes sense then that although a topic about some                   duce high-level topics without additional boosting
event in Ukraine may be linked to as many documents                 of entities or hashtags, either by lowering the simi-
as another topic about a concert, however consider-                 larity threshold or by clustering second-level super-
ing the overall attention that the events in Ukraine                documents, but both these approaches resulted in
received, the Ukraine related topic should be ranked                mixed topics. It appears that these mixed topics were
higher.                                                             formed based on less important textual features which
In order to take advantage of this, we apply the fol-               are more common across different topics. On the other
lowing procedure. We perform a new clustering of the                hand, the applied approach of boosting entities and
documents, but this time we boost further the weight                hashtags in a more aggressive manner did not produce
of hashtags and entities. The boost factor is not the               any mixed topics and did indeed manage to surface
same for each entity and hashtag, instead, it is linear             the higher level topics.
to its frequency of appearance in the corpus. More
formally, tf -idfij weights are computed as follows (cf.
                                                                    3.4   Title extraction
Eq. 1):
              (                                                     We first split the text of each tweet in the cluster into
                  cf j × tfij × idf j , if j is entity or hashtag   sentences to obtain a set of candidate titles. Clearly,
tf -idfij =
                  tfij × idf j ,        otherwise                   splitting the text into sentences makes sense, as the
                                                              (2)   title has to be a coherent piece of language. To ob-
                                                                    tain sentence separation we again use the Stanford
NLP library. Having an initial set of candidate titles,     pendent vocabulary as we did for the titles. However,
we subsequently compute the Levehnstein distance be-        for keyword extraction we are not limited to selecting
tween each pair of candidate titles in order to reduce      a single candidate, as is the case for title extraction.
the number of actual candidates. In the final step we       Thus, we need a mechanism for selecting the number
rank the candidate titles using both their frequency        of top ranked candidate titles. We utilize a “largest
and their textual features. The score of the title is the   gap” heuristic to do this. That is, after ranking the
product of its frequency and the average likelihood of      candidate keywords we compute the score difference
appearance of the terms that it contains in an indepen-     between subsequent candidates, we find the position
dent corpus. The likelihood of appearance of a term         in the ranked list with the largest difference and select
t was obtained using a smoothed estimate in order to        all terms until that position.
account for terms not appearing in the independent          At the final step of the process, we add to the set of
corpus:                                                     keywords the set of most important entities. These
                            ct + 1                          are determined using a similar “largest gap” heuristic
                     p(t) =                           (3)
                            N +V                            and we only add them if they do not already appear
                                                            as part of a phrase in the set of keywords. Finally, it
where ct is the count of appearances of t in the in-
                                                            should be noted that we use the Stanford NLP library
dependent corpus, N is the total number of (non-
                                                            to obtain the noun and verb phrases. However, in-
unique) terms in the corpus and V is the vocabulary
                                                            stead of doing a full parsing of the texts, which would
size (larger than the number of unique terms in the
                                                            be computationally costly, we perform part of speech
corpus). The corpus that was utilized to obtain these
                                                            tagging and apply some heuristic rules to obtain noun
estimates was collected by randomly sampling from
                                                            and verb phrases from part of speech tags. More par-
the Twitter streaming API and consisted of 1,954,095
                                                            ticularly, we identify sequences of terms consisting only
tweets. It should also be noted that removed candi-
                                                            of nouns, adjectives and possessive endings (e.g. “’s”)
dates increase the frequency count of their most sim-
                                                            as noun phrases and we identify sequences of terms
ilar candidate and also that, despite the fact that we
                                                            consisting only of verbs as verb phrases.
do not process duplicate items, the count of duplicates
removed for each processed item contributes to the fre-
quency of the sentences extracted from it.                  3.6   Representative tweets selection
                                                            The challenge also requires that a number of represen-
3.5   Keyword extraction                                    tative and as much as possible diverse tweets is pro-
                                                            vided for each topic. The set of related tweets can
The keyword extraction process is similar to the title
                                                            be easily obtained in our approach, since we utilize
extraction process. However, instead of complete sen-
                                                            a document-pivot method. Regarding diversity, the
tences, we now examine either noun phrases or verb
                                                            duplicate removal step that we apply at the first stage
phrases. We decided to work with noun phrases and
                                                            of our processing partly takes care of this requirement.
verb phrases instead of unigram terms because they
                                                            However, there are still some near duplicates that were
generally provide a less ambiguous summary of topics.
                                                            not captured by the duplicate removal step. Addition-
In particular, short phrases can be more meaningful,
                                                            ally, to introduce as much diversity as possible, we
regardless of the order that they appear in, as com-
                                                            make sure that all replies from the topic’s cluster are
pared to single terms. For instance, let us consider
                                                            included in the set of representative tweets and addi-
one of the topics in the first timeslot in the test set.
                                                            tionally we include the most frequent tweets (making
That topic is about Ukrainian journalists publishing a
                                                            sure that the total number of selected tweets is at most
number of documents found in president Yanukovich’s
                                                            10).
house. The set of keywords we produced was: “secret
documents”, “Yanukovich ’s estate”, “Ukraine euro-
                                                            3.7   Relevant image extraction
maidan”, “was trying”, “president ’s estate”. One can
see that regardless of the sequence of these phrases,       We retrieve relevant images by applying a very simple
one can grasp a fairly good idea about the topic. If        procedure. In particular, if the tweets associated with
however we used single terms, it could be possible,         a topic contain the URL of some images, then we find
depending on the order of terms, that some of them          the most frequent image and return it. Otherwise, we
may be incorrectly associated, e.g. “secret” could be       issue a query to the Google search API, searching by
associated to the term “estate” instead of the term         the title of the topic and associate to the topic the first
“documents”.                                                image returned. In a few cases, this did not return any
Eventually, in order to select the keywords we rank         results; then we issue a further query, this time using
them according to their frequency in the clustered doc-     the most popular keyword. It should be noted though
uments and their likelihood of appearance in an inde-       that this approach has a limitation: the Google search
               4
            x 10
                                                                                         appears that all topics are related to a distinct event,
      2.5
                                              # tweets (original)                        which is fairly well represented by both the title and
                                              # tweets (after duplicate aggregation)
                                              # tweets (after language filtering)        the keywords. It should be noted though, that in some
                                              # first−level clusters
       2
                                              # second−level clusters
                                                                                         cases the set of keywords may not be enough by itself
                                                                                         to provide a very clear picture of the essence of the
      1.5
                                                                                         story. For instance, in the story about the Ukrainian
                                                                                         parliament voting to send Yanukovich to Hague, the
                                                                                         keyword Hague is missing, although it should be in-
       1
                                                                                         cluded. Thus, the keyword extraction process may be
                                                                                         improved, by appropriately changing the mechanism
      0.5                                                                                of automatically selecting the number of phrases and
                                                                                         entities to return (across the complete test collection,
       0                                                                                 the minimum number of keywords retrieved was 1, the
        0          10   20   30   40     50       60      70      80       90      100
                                       Timeslot                                          maximum was 5 and the average was 2.625). Also,
                                                                                         due to the heuristic that we applied to rapidly retrieve
Figure 1: Number of tweets before and after duplicate                                    noun and verb phrases, we occasionaly have mixed
aggregation and language filtering, as well as the num-                                  noun and verb phrases, e.g. the phrase “Ukrainian
ber of first-level and second-level clusters produced                                    parliament votes”. The title on the other hand makes
                                                                                         perfect sense and is in all displayed topics (and most
API allows only a specific number of queries per day                                     other topics as well) very indicative of the topic. Fi-
and thus we had to issue repeated queries for a long                                     nally, the multimedia retrieved are sometimes very rel-
period of time in order to obtain results for each image.                                evant and some times not too much; e.g., for the topic
A potentially better option in that respect would be                                     about the cost of Yanukovich’s house, the retrieved
to use a different search API, such as Twitter’s.                                        image is the frontpage of some newspaper.


4   Evaluation                                                                           5   Conclusions
In the following we examine different aspects of the ap-
                                                                                         In this paper we presented the approach pursued by
plied approach and then we comment on the quality
                                                                                         our team for participating to the SNOW 2014 Data
of the produced topics. Figure 1 displays the number
                                                                                         Challenge. In short, we have utilized a document-pivot
of tweets before and after duplicate aggregation and
                                                                                         approach, however we have taken advantage of features
language filtering, as well as the number of first-level
                                                                                         that allow us to improve the quality of the detected
and second-level clusters produced. One thing to note
                                                                                         clusters. In particular, we have taken advantage of
is that for all timeslots there is a significant reduction
                                                                                         commonly appearing URLs and of reply relationships
on the number of tweets to be clustered after dupli-
                                                                                         between tweets, formulating a two-level clustering pro-
cate aggregation and language filtering. Additionally,
                                                                                         cedure. We have tuned our clustering, so that it pro-
for all timeslots there is a number of – typically a few
                                                                                         vides a set of topics at the required granularity, i.e.
hundred – first-level clusters, each of which contains
                                                                                         low level stories rather than high-level topics at the
at least two tweets, meaning that we immediately and
                                                                                         cost of some fragmentation. In practice, this provided
without resorting to any complicated clustering oper-
                                                                                         very good topics. Subsequently, we apply a number of
ations we have obtained initial clusterings for a signifi-
                                                                                         NLP techniques in order to enrich the representation
cant part of the tweets to be clustered. The number of
                                                                                         of topics: we use sentence splitting for title extraction
second-level topics is typically larger though as tweets
                                                                                         and we use noun and verb phrase extraction for iden-
that did not form first-level clusters also participate
                                                                                         tifying key phrases. Additionally, we identify that the
in the second-level clustering procedure. It is also in-
                                                                                         ranking of some topic should be related to the impor-
teresting to note that the computational cost of the
                                                                                         tance of any larger topic that it may be linked to and
complete procedure for each timeslot is not that high.
                                                                                         we apply an appropriate procedure in order to achieve
In particular, the complete set of operations (first and
                                                                                         a two-level ranking of topics
second level clustering, ranking, title and keyword ex-
traction as well as relevant image retrieval) took on
average 65.33 seconds per timeslot on a machine with                                     Acknowledgments
moderate computational resources (Intel Q9300 CPU
running at 2.5 GHz and 4GBs of RAM).                                                     This work is supported by the SocialSensor FP7
Table 1 presents the ten topics produced by our ap-                                      project, partially funded by the EC under contract
proach for the first test timeslot. As a first remark, it                                number 287975.
References                                               [Ble03] D. Blei, A. Ng, M. Jordan. Latent Dirichlet
                                                                 allocation. Journal of Machine Learning Re-
[Pap14] S. Papadopoulos, D. Corney, L. Aiello.
                                                                 search, 3:993–1022, 2003.
        SNOW 2014 Data Challenge: Assessing the
        Performance of News Topic Detection Meth-        [Ble12] D. Blei. Probabilistic topic models. Commun.
        ods in Social Media. Proceedings of SNOW                 ACM, 55(4):77–84, 2012.
        2014 Data Challenge, 2014.
                                                         [Aie13] L. Aiello, G. Petkos, C. Martin, D. Corney, S.
[Pet10] S. Petrović, M. Osborne, V. Lavrenko.                   Papadopoulos, R. Skraba, A. Goker, I. Kom-
        Streaming First Story Detection with Appli-              patsiaris, A. Jaimes. Sensing trending topics
        cation to Twitter. HLT: Annual Conference                in Twitter. IEEE Transactions on Multime-
        of the North American Chapter of the Associ-             dia, 15(6):1268–1282, Oct 2013.
        ation for Computational Linguistics, 2010.
                                                         [Cor09] T. Cormen, C. Leiserson, R. Rivest, C. Stein.
[Wen10] J. Weng, E. Lim, J. Jiang, Q. He. Twitter-
                                                                 Introduction to Algorithms, Third Edition.
       Rank: finding topic-sensitive influential twit-
                                                                 The MIT Press, 2009.
       terers. Proceedings of the third ACM inter-
       national conference on Web search and data
       mining, 2010.
Title: Fight for the right to be free!
Keywords: Ukraine, madonna, free !! fight fascism
Relevant tweet: @Madonna: Fight for the right to be free!! Fight Fascism everywhere! Free
Venezuela the Ukraine&Russia #artforfreedom


Title: Ukraine’s toppling craze reaches even legendary Russian commander, who fought
Napoleon
Keywords: Legendary russian commander, Ukraine
Relevant tweet: @RT com #Ukraine toppling craze reaches even legendary Russian com-
mander,who fought Napoleon http://on.rt.com/izqunf
Title: Ukraine parliament votes to send Yanukovych to The Hague
Keywords: Ukraine parliament votes, Yanukovych
Relevant tweet: #Ukraine parliament votes to send Yanukovych to The Hague
Title: Ukraine’s president spent $ 2.3 m on dining room decor, $ 17k tablecloths, $ 1m to
water his lawn
Keywords: Ukraine ’s president, dining room decor
Relevant tweet: #Ukraine’s president spent $2.3M on dining room decor, $17K tablecloths,
$1M to water his lawn
Title: Journalists in Ukraine are in the process of uploading 1000s of secret documents found
at Yanukovich’s estate
Keywords: Secret documents, Yanukovich ’s estate, Ukraine euromaidan, was trying, presi-
dent ’s estate
Relevant tweet: The #YanukovychLeaks is up! Here are the documents recovered at the
ousted presidents estate. #Ukraine #euromaidan
Title: Mt. Gox takes Bitcoin exchange offline as currency woes mount, does not say when
transactions/withdawals will resume
Keywords: Gox, bitcoin exchange offline, currency woes
Relevant tweet: Mt. Gox takes #Bitcoin exchange offline as currency woes mount,
http://fxn.ws/1ppoMGk @joerogan
Title: Can’t decide if I want to write this week’s Most Googled Song about Seth Myers
Jimmy Fallon or Bitcoin. Thoughts??
Keywords: Bitcoin, Seth
Relevant tweet: Can’t decide if I want to write this week’s Most Googled Song about Seth
Myers & Jimmy Fallon or Bitcoin... Thoughts??
Title: Syria aid still stalled after UN.
Keywords: Melawanlupa syria aid, resolution, stalled
Relevant tweet: #MelawanLupa RT #Syria #aid still stalled after #UN. resolution
http://reut.rs/1mwaqlh
Title: Remarks at today’s UN General Assembly briefing on the Humanitarian Situation in
Syria
Keywords: Today ’s un general assembly briefing
Relevant tweet: Remarks by @AmbassadorPower at today’s UN General Assembly briefing
on the Humanitarian Situation in #Syria: http://go.usa.gov/Bt2d
Title: Usmnt’s friendly vs Ukraine on March 5 moved to Cyprus, according to Ukraine’s
Football Federation.
Keywords: Usmnt ’s friendly, Ukraine’s football federation
Relevant tweet: #USMNT’s friendly vs Ukraine on March 5 moved to Cyprus, according to
Ukraine’s Football Federation. http://foxs.pt/NuDSvq

                   Table 1: The 10 topics produced by our approach for the first timeslot.