=Paper= {{Paper |id=Vol-1150/overview |storemode=property |title=SNOW 2014 Data Challenge: Assessing the Performance of News Topic Detection Methods in Social Media |pdfUrl=https://ceur-ws.org/Vol-1150/overview.pdf |volume=Vol-1150 |dblpUrl=https://dblp.org/rec/conf/www/PapadopoulosCA14 }} ==SNOW 2014 Data Challenge: Assessing the Performance of News Topic Detection Methods in Social Media== https://ceur-ws.org/Vol-1150/overview.pdf
SNOW 2014 Data Challenge: Assessing the Performance
  of News Topic Detection Methods in Social Media

          Symeon Papadopoulos                        David Corney                         Luca Maria Aiello
              CERTH-ITI                          Robert Gordon Univ.                         Yahoo Labs
           Thessaloniki, Greece                     Aberdeen, UK                           Barcelona, Spain
             papadop@iti.gr                     d.p.a.corney@rgu.ac.uk                  alucca@yahoo-inc.com



                                                                instance, for timeslot S i , one would produce a ranked
                                                                list T i = {T1i , T2i , ..., TLi }, where L is the maximum
                       Abstract                                 number of topics allowed per timeslot. Each topic T
                                                                is associated with a headline h, a set of tags (annota-
    The SNOW 2014 Data Challenge aimed at                       tions) A, a set of representative tweets M ⊂ M, and
    creating a public benchmark and evaluation                  optionally a set of links to images P . Table 1 summa-
    resource for the problem of topic detection in              rizes the Challenge terminology.
    streams of social content. In particular, given
    a set of tweets spanning a time interval of in-                           Table 1: Challenge terminology
    terest, the Challenge required the extraction
    of the most significant news topics in short                 Symbol                          Explanation
    timeslots within the selected interval. Here,
                                                                 M                               Set of Twitter messages
    we provide details with respect to the Chal-
                                                                 S                               Set of timeslots of interest
    lenge definition, the data collection and eval-
                                                                 S i = (tistart , tiend ) ∈ S    Timeslot i
    uation process, and the results achieved by the
                                                                 T i = {T1i , T2i , ..., TLi }   Ranked list of maximum L
    11 teams that participated in it, along with a
                                                                                                 topics for timeslot S i
    concise retrospective analysis of the main con-
                                                                 T = (h, A, M, P )               Topic T consists of a head-
    clusions and arising issues.
                                                                                                 line h, a set of tags A, and
                                                                                                 set of representative tweets
1    Overview                                                                                    M and images P .
Consider a scenario of news professionals who use so-
cial media to monitor the newsworthy stories that
emerge from the crowd. The volume of information                   The Challenge stated that the ranking of topics per
is very high and it is often difficult to extract such          timeslot should be based on the newsworthiness of top-
stories from a live social media stream. The task of            ics. An operational definition for newsworthiness was
the SNOW 2014 Data Challenge has been to auto-                  adopted: for a given timeslot we sought topics that
matically mine social streams, in particular Twitter,           would turn out to be important enough to be covered
to provide journalists with a set of the most important         in mainstream news sites.
topics for a number of timeslots of interest. In math
                                                                   In terms of organization, the Challenge proceeded
terms, given a set of Twitter messages M spanning the
                                                                as follows: An Open Call was published at the begin-
interval (t0 , tmax ) and a set of K timeslots of interest
                                                                ning of December 20131 , 25 participating teams reg-
S = {S i |S i = (tistart , tiend )}, where t0start ≥ t0 and
                                                                istered until the end of January, 11 successfully sub-
tK
 end ≤ tmax , the Challenge required participants to            mitted runs at the beginning of March 2014, and 9
produce K ranked lists of topics, one per timeslot: for
                                                                of them submitted papers describing their approach,
Copyright c by the paper’s authors. Copying permitted only      making up the content of the SNOW 2014 Data Chal-
for private and academic purposes.                              lenge proceedings.
In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings
of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014,
published at http://ceur-ws.org                                   1 http://www.snow-workshop.org/2014/?page_id=37
2    Data and Ground Truth                                             so we used three keywords: flood, floods, and flooding.
A total of three sets of tweets were used: a develop-
                                                                          The third set of tweets formed the test set for the
ment set, a rehearsal set and a test set2 . The devel-
                                                                       final evaluation. We used the same set of UK-focussed
opment set consisted of 1,106,712 tweets that we had
                                                                       newshounds. For the keywords, we considered which
previously collected during the 2012 US Presidential
                                                                       stories were likely to continue generating widespread
election3 . We had previously analysed these tweets
                                                                       interest and comment, making our final choice imme-
and produced a ground-truth of mainstream media
                                                                       diately before the crawl started. On the morning of the
stories during the period. We compared several topic-
                                                                       main crawl (25/02/2014), a British national and for-
detection algorithms using this data as described in
                                                                       mer Guantanamo Bay detainee, Moazzam Begg, had
[Aie13]. The set of IDs corresponding to these tweets,
                                                                       been arrested on terrorism charges related to Syria
with a representative sample of the associated ground-
                                                                       and this was likely to be discussed. The uprising in
truth topics, were shared at the start of the challenge
                                                                       Ukraine was continuing to generate news stories and
to allow participants to carry out their own internal
                                                                       great interest. A major bitcoin exchange (Mt. Gox)
evaluations as they developed their systems. To as-
                                                                       had suffered a major theft, and this story was likely to
sist with the tweet collection, we also made available
                                                                       generate a lot of comments online, given the technol-
a simple tweet scraping utility4 .
                                                                       ogy angle of the story. We also considered protests in
   For the second and third sets, we directed partici-
                                                                       Venezuela, but an initial search suggested that there
pants to collect tweets via the Twitter Streaming API
                                                                       was relatively little UK interest in the events, possibly
(also making available a simple wrapper utility), fil-
                                                                       due to the fact that much of the first-hand reporting
tering the stream by using provided lists of both users
                                                                       was in Spanish. We therefore chose four keywords:
and keywords. By collecting tweets with the Stream-
                                                                       Syria, terror, Ukraine and bitcoin. The test set collec-
ing API, participants could avoid potential issues with
                                                                       tion started on February 25th, 18:00 GMT and lasted
post-hoc collection of tweets, e.g. via scraping. For
                                                                       for 24 hours.
the user list, we shared a previously-generated list of
circa 5000 “newshounds”. A newshound is a Twit-
                                                                          This combination of keywords and newshounds was
ter account that tends to report on and discuss ma-
                                                                       expected to produce a substantial but manageable
jor news events, and includes journalists, news out-
                                                                       volume of tweets, covering a wide range of stories
lets and agencies, commentators, politicians and other
                                                                       from around the world, but of specific interest to UK-
opinion-formers. In this case, the 5000 selected are
                                                                       focussed journalists. In this way, we could use the
UK-focussed newshounds derived initially from ac-
                                                                       UK mainstream media as a basis for deciding the
counts on several public Twitter lists, and then en-
                                                                       “ground truth” list of target topics. It also meant that
hanced by analysing their followers. Previous work has
                                                                       we could ignore non-English language messages and
shown that using these newshounds to filter the Twit-
                                                                       topics, avoiding the complicating issue of translation.
ter stream produces a range of newsworthy tweets.
                                                                       Although it is used globally, Twitter remains domi-
Note that the Streaming API returns all messages sent
                                                                       nated by English-language tweets. The final number
by any user on the list, and also all messages that men-
                                                                       of tweets collected was 1,041,062, representing an av-
tion them. In this way, we collect messages sent to
                                                                       erage of c.720 tweets per minute. Note that the exact
journalists, e.g. by eye-witnesses or others with infor-
                                                                       number of tweets that each participant obtained from
mation to share.
                                                                       the Twitter Stream depends on local network connec-
   The second set of tweets was designed as a rehearsal                tivity and the slightly stochastic nature of the Stream-
crawl, with the aim of ensuring that the participants                  ing API. As we experimentally observed by running
were able to collect and process realistic volumes of                  independent collection jobs, this varied by just 0.2%
tweets. No ground truth was provided, but partici-                     or 0.3% of the total number of tweets collected. We
pants could carry out their own informal evaluations.                  therefore shared the ID numbers of all tweets collected,
In addition to the list of UK-focussed newshounds, we                  allowing participants to download any tweets missing
selected keywords based on events around the time of                   from their local collections. The tweets were sent by
the crawl. During the rehearsal, UK news was domi-                     556,295 accounts, contained 648,651 retweets, 135,141
nated with stories of flooding in the West of England,                 replies and just 8,811 (0.85%) of them were geotagged.
    2 The development and test are publicly available: http://

figshare.com/articles/SNOW_2014_Data_Challenge/1003755.                   We also generated the list of reference topics T ref
    3 The original set, also used in [Aie13], was larger, as several
                                                                       (ground truth), consisting of 59 topics that were the
tweets had been removed from Twitter since the collection was
first made.
                                                                       basis of mainstream media stories in UK news out-
    4 Source code publicly available on: https://github.com/           lets during the 24-hour period of the crawl. We pro-
socialsensor/twitter-dataset-collector                                 duced this by first collecting the headlines from both
the BBC RSS news feed5 and from NewsWhip UK6 .              did not know which participant produced any topic
From these lists, we merged duplicated stories; re-         they evaluated. The resulting topic annotations were
moved some stories of limited local or regional inter-      saved in a relational database, and aggregate statis-
est; and removed several commentary, ‘op-ed’ or spec-       tics and results were derived with the use of SQL and
ulative pieces. We finally checked that the remaining       some further programmatic post-processing of results
stories were represented in the collected tweets, and       in some cases.
removed any that were not. This resulted in 59 dis-            As described in the Challenge page, four evaluation
tinct stories spread over 24 hours. In principle, an        criteria were used: a) precision-recall, b) readability,
ideal topic-detection algorithm should be able to anal-     c) coherence/relevance, d) diversity. The first would
yse the collection of tweets and identify all 59 stories    be quantified by means of the F-score (0-1), while the
as major news events, along with a number of other          other three would be assessed on a five-level Likert
events. The aim in creating this ground-truth was not       scale. In the following, we provide further details with
to be exhaustive (which is effectively impossible, given    respect to the computation of the above measures. In
the scale of events taking place around the world in 24     addition, submissions were evaluated with respect to
hours, and the imprecise nature of what constitutes         image relevance, by means of a precision score (0-1),
“news”); rather the aim is to produce a wide-ranging        but this was not taken into account for the final rank-
set of news stories covering politics, sports, interna-     ing since associating images with topics was optional.
tional events, conflicts and so on, each of which was
significant enough to generate substantial mainstream
                                                            3.1    Precision-recall
media and social media coverage.
   For each story Tiref ∈ T ref , we identified the ap-     Precision and recall were derived with respect to two
proximate time that the story first appeared; the head-     sets of reference topics: The first, T ref , comprised the
line or label; a list of around three to five keywords or   59 topics manually created by the organizers as de-
named entities defining the story; a list of two to five    scribed above, while the second, denoted as T ext , was
representative tweets from the collection; and where        created in a pooled way based on the submissions of
appropriate, the URLs of one or more related images,        participants during the five selected timeslots. More
as shared through the collected tweets. Note that we        specifically, the evaluators assessed (using a tick box)
did not expect participants to retrieve these specific      all submitted topics during those five timeslots as be-
tweets or URLs; they were merely indicative of the          ing newsworthy or not (cf. paragraph 3.1.1). Top-
contents of the target topic. This information for each     ics that received at least two votes by evaluators were
story was then used by the evaluation team to measure       included in a list. After removing duplicates, a set
how effectively each participating team had discovered      of |T ext | = 70 participant-pooled topics were defined.
each story.                                                 Note that a few of those topics were also included in
                                                            T ref .
3     Evaluation Protocol                                      In the case of T ref we computed only recall: for
Participants were asked to produce up to L = 10 topics      each participant and for each topic of T ref , the evalu-
per 15-minute timeslot for all timeslots of the 24-hour     ators identified, with the help of a text-search facility
test set interval. Thus, each participant could submit      offered by the evaluation web application, at least one
up to 24 × 4 × 10 = 960 topics. The topics produced         matching topic in the full set of submitted topics7 . In
by participants were submitted in the agreed format         the end, for each participant v, we computed a re-
(the same as the one used by the reference topics) to a     call score Rref (v) ∈ [0, 1] by dividing the number of
web application. After submission, participants could       matched topics Ncref (v) with 59. Note that each eval-
browse through their topics and upload new versions         uator performed the matching described above for a
of their submission until the submission deadline.          part (approximately one-third) of the 59 topics.
   Subsequently, the evaluation was conducted by               In the case of T ext , evaluators manually matched
three independent evaluators, located in different          the topics of each participant during the five selected
countries and organizations. The web-based submis-          timeslots to the topics of T ext . After the matching, we
sion application also offered topic annotation features     could easily compute for each participant v the number
(cf. Figure 1) that were used to assist them in the eval-   of correctly matched topics Ncext (v) and the number
uation. The evaluation was done on a set of five times-     of unique correctly matched topics Ncext∗ (v) (since a
lots (starting at 18:00, 22:00, 23:15 on 25/2, and on       participant might detect the same topic in multiple
1:00, 1:30 on 26/2), and was blind, i.e. the evaluators     timeslots). Then, for each participant we could com-

    5 http://feeds.bbc.co.uk/news/rss.xml                      7 This was the only case where the full set of submitted topics
    6 http://www.newswhip.com/U.K.                          submitted by participants was used.
pute precision and recall as follows:                              3.3   Coherence

                 Ncext (v)                      Ncext∗ (v)         A similar process was followed for computing coher-
   Pext (v) =                      Rext (v) =                (1)   ence C(v), this time using the guidelines of Table 4.
                  N (v)                            70
                                                                   The main criterion for assessing coherence is the rele-
where N (v) is the total number of topics submitted by             vance of the representative tweets with the topic head-
v during the five selected timeslots. On the basis of              line. In addition, apart from the headline, evaluators
precision and recall, F-score was computed as usual:               were also instructed to consider tags: in case some of
                                                                   them were found to be irrelevant to the topic headline,
                             2 · Pext (v) · Rext (v)               they should decrease the coherence score (accordingly
               Fext (v) =                                    (2)   to the number of irrelevant tags). Finally, evaluators
                              Pext (v) + Rext (v)
                                                                   were instructed to ignore near-duplicate tweets (i.e.
3.1.1      Newsworthy assessment                                   neither penalize nor increase the topic coherence).
                                                                            Table 4: Coherence scoring guidelines
Evaluators assessed each of the submitted topics (be-
longing to the five selected timeslots) as being news-
worthy or not based on the positive and negative ex-                Score     Description
amples of Table 2.                                                    5       All tweets and tags are relevant
                                                                      4       More relevant than non-relevant
Table 2: Newsworthy positive and negative examples                    3       About the same relevant and non-relevant
                                                                      2       Less relevant than non-relevant
 Type        Description                                              1       None of the tweets or tags are relevant
  +          Major news story and/or included in Tref
  +          Photo-driven news story
  +          Local news story
  +          Announcement of future (scheduled) event              3.4   Diversity
  +          Goal scored in a football match                       To compute diversity D(v), evaluators were instructed
   -         Opinion article                                       to look into the number of different tweets associated
   -         Analysis article                                      with a topic: to consider a tweet as different from
   -         Speculation                                           another, the tweet should convey some additional in-
   -         Jokes, gossip                                         formation. Moreover, compared to a topic that does
   -         Fake news (e.g. from theonion.com)                    not contain any duplication in its tweets, a topic with
                                                                   duplication should be slightly penalized. Depending
                                                                   on the degree of duplication, one may subtract 0.5 to
                                                                   1 points from the score that they would otherwise as-
3.2       Readability                                              sign. Table 5 provides further guidelines on assigning
Evaluators were instructed to assign a score between               diversity scores.
1 and 5 (half points were also possible) according to                       Table 5: Diversity scoring guidelines
the guidelines of Table 3. For each participant v, the
readability score Q(v) was computed only on the basis               Score     Description
of the newsworthy topics, and by averaging over the                   5       Several (> 3) different relevant tweets
three evaluators.                                                     4       A few different relevant tweets
           Table 3: Readability scoring guidelines                    3       At least two different relevant tweets
                                                                      2       One relevant tweet
 Score        Description                                             1       All tweets are irrelevant
   5          Understandable, readable and grammati-
              cally correct
      4       Understandable but may contain minor                 3.5   Image relevance
              grammatical errors
      3       Includes keywords that convey the story              To assess the relevance of an image, the evaluators
              but contains major grammatical errors                needed to tick a special box in case they found the
      2       Hard to read and understand                          image(s) relevant to the topic under test. In cases
      1       Completely incomprehensible or nonsense              of multiple pictures, the evaluators should make their
                                                                   decision based on the general impression. In the end,
                                                                   a single image relevance score I(v) was computed for
                   (a) Q=5, C=5, D=2                                           (b) Q=3, C=5, D=2

Figure 1: Snapshot from the topic annotation interface, featuring a newsworthy and a non-newsworthy topic.
Captions depict the scores assigned by one of the evaluators.

each participant by computing the percentage of rele-         4   Results
vant images in the set of newsworthy topics and aver-
aging over the three evaluators.                              Table 6 presents the raw scores achieved for each of
                                                              the selected evaluation measures by the 11 participat-
3.6   Normalization and aggregation                           ing teams. It is noteworthy that for each evaluation
                                                              measure there is a different best method. For instance,
For each of the scores used for the ranking, Rref , Fext ,    the method by Insight [Ifr14] is clearly best in terms
Q, C and D, we first identified the maximum attained          of recall (in both T ref and T ext ) and coherence, the
         max    max
scores Rref  , Fext , Qmax , C max and Dmax , and then        method by PILOTS [Nut14] best in terms of readabil-
normalized the scores of each participant with respect        ity, while the method by SNOWBITS [Bha14] best in
to the latter. For instance, the normalized readability       terms of diversity.
score would be:                                                  A second noteworthy conclusion is that almost all
                               Q(v)                           methods produce high-quality topic headlines (Q > 4)
                    Q∗ (v) =                            (3)   and mostly coherent topics (C > 4). However, the
                               Qmax
                                                              majority of methods suffer from decreased diversity
In the end, the aggregate score for each participant          (D < 3). This can be explained by the fact that the
was derived by the following equation:                        majority of topics produced by methods are associated
                                                              with very few tweets (typically between one and three)
                  ∗          ∗
  AS(v) = 0.25 · Rref (v) · Fext (v) + 0.25 · Q∗ (v)          resulting into very tight topics. Typically, while some
                     + 0.25 · C ∗ (v) + 0.25 · D∗ (v)   (4)   tweets about the same topic share many of the same
                                                              terms, other tweets will use distinct words. Methods
This was the score used to derive the final ranking for       that form topics based on textual-similarity of mes-
the Challenge.                                                sages may therefore tend to produce these tight topics
with very low diversity. Finally, in terms of image              Table 7: Absolute number of discovered topics
relevance, several methods managed to achieve satis-
factory performance, with more than half of the pro-             Team                      Ncref    Ncext   Ncext∗
posed images being considered as relevant to the topic           UKON [Pop14]               26       13      13
headline (I > 50%).                                              IBCN [Can14]               34       12      12
    Table 9 presents the normalized scores for each cri-         ITI [Pet14]                19       22      15
terion, the aggregate score and the final ranking for            math-dyn [Bur14]           37       18      14
all participants. The three winning teams are Insight            Insight [Ifr14]            39       28      25
[Ifr14], RGU [Mar14] and math-dyn [Bur14]. One may               FUB-TORV [Ama14]           23       4        2
conclude that the most distinguishing evaluation mea-            PILOTS [Nut14]             14       4        4
sure is topic recall and F-score with standard devia-            RGU [Mar14]                33       19      17
tions of 0.292 and 0.29 respectively across participants,        UoGMIR                     10       36      15
while the least discriminative measures are coherence            EURECOM                    14       1        1
and readability with standard deviations of 0.084 and            SNOWBITS [Bha14]            8       8        7
0.09 respectively.
    Another interesting conclusion can be drawn by
looking into the absolute number of unique topics that                  Table 8: Inter-annotator agreement
each method discovered within the five selected times-
lots (Table 7). The method with the highest recall                        Eval. 1-2     Eval. 1-3     Eval. 2-3
[Ifr14] managed to discover 25 of the 70 topics of T ext .       Rref      0.8949        0.9302        0.8120
Given that those 70 topics are the result of topic pool-           ∗
                                                                 Pext      0.8956        0.8823        0.8587
ing over the results of all methods, we may conclude             Q         0.9021        0.3577        0.2786
that there is much room for improving topic recall by            C         0.5495        0.7307        0.6844
combining the results of multiple methods (ensemble              D         0.8734        0.8904        0.9059
topic discovery).                                                I         0.9449        0.9195        0.7960

4.1   Robustness of results
                                                             5    Outlook
To make sure that the evaluation results are robust,         In retrospect, the SNOW 2014 Data Challenge man-
we looked into the following: a) inter-annotator agree-      aged to bring together a number of researchers work-
ment (quantified by computing the pairwise correla-          ing on the problem of topic detection in noisy text
tions of the evaluation distributions), b) alternative       streams. Conducting a fair and thorough evaluation
score aggregation methods. With respect to the first,        of the competing methods proved to be a highly com-
we were pleased to note that there was significant           plicated task, calling for a variety of evaluation criteria
agreement between all three evaluators across most           and in-depth analysis. The results of this report along
of the evaluation aspects (readability appears to be         with the descriptions of the referenced methods offer
the most subjective of all) as Table 8 testifies. With       a number of lessons and valuable resources to the re-
respect to the latter, we were positively surprised by       searchers working on the field.
the fact that several alternative normalization and ag-
                                                                At this point, we should highlight a few limita-
gregation schemes led to very similar rankings. More
                                                             tions of the evaluation approach. A first one concerns
specifically, the first three methods remained the same
                                                             the limited number of timeslots (and hence topics) as-
for a number of different variations based on two
                                                             sessed by the evaluators due to the limited time and
schemes:
                                                             resources available for evaluation. In the future, one
                                                             should consider the use of crowdsourcing platforms in
  • changing the weights of the aggregation scheme of        order to increase the breadth of the evaluation. In ad-
    Equation 4 (instead of setting them all equal to         dition, the evaluation was limited to a specific times-
    0.25);                                                   lot size (15 minutes), targeting a nearly real-time sce-
                                                             nario. Assessing the performance over larger timeslots
  • subtracting the average value for each score (in-        (e.g. hour, day) could also be considered valuable for
    stead of just dividing with the maximum value).          a number of applications; however, one should keep
                                                             away from extrapolating the conclusions drawn from
The stability of results over different normalization        this Challenge to those settings, as the performance
and aggregation schemes gives more confidence and            of different methods may be affected in different ways
credibility on the derived ranking.                          with the increase of timeslots (some methods might
                                          Table 6: Overview of raw scores

             Team                      Rref    Pext     Rext    Fext     Q      C       D       I
             UKON [Pop14]              0.44   0.481    0.186   0.268    4.29   4.40    2.12   0.542
             IBCN [Can14]              0.58   0.522    0.171   0.258    4.92   4.08    2.36   0.318
             ITI [Pet14]               0.32   0.440    0.214   0.288    4.49   4.68    2.31   0.581
             math-dyn [Bur14]          0.63   0.462    0.200   0.279    4.59   4.91    2.11   0.520
             Insight [Ifr14]           0.66   0.560    0.357   0.436    4.74   4.97    2.11   0.274
             FUB-TORV [Ama14]          0.39   0.267    0.029   0.052    4.18   4.78    2.00     -
             PILOTS [Nut14]            0.24   0.400    0.057   0.099    4.93   4.83    1.92     -
             RGU [Mar14]               0.60   0.388    0.243   0.299    4.71   4.22    3.27   0.588
             UoGMIR                    0.17   0.800    0.214   0.338    4.80   3.95    2.36     -
             EURECOM                   0.24   0.125    0.014   0.027    3.38   3.75    2.50     -
             SNOWBITS [Bha14]          0.14   0.800    0.100   0.178    4.32   4.36    3.47   0.186

benefit, while others might suffer).                        pants for their hard work and patience throughout the
   Yet another limitation of the conducted evaluation       Challenge.
pertains to assessing the timeliness of detected topics.
When matching the submitted topics against the refer-       Acknowledgements
ence topics T ref , the evaluators completely ignored the
                                                            This work has been supported by the SocialSensor FP7
temporal information. In that way, a method that dis-
                                                            project, partially funded by the EC under contract
covered a topic early on would be considered equally
                                                            number 287975.
good with one that discovered the same topic many
hours later. Obviously, this is an important perfor-
mance aspect, especially in the context of breaking         References
news detection, which should be taken into account in       [Aie13] L. Aiello, G. Petkos, C. Martin, D. Corney, S.
future evaluation efforts.                                          Papadopoulos, R. Skraba, A. Goker, I. Kom-
   Last but not least, we should acknowledge that the               patsiaris, A. Jaimes. Sensing trending topics
type of topics sought is another important aspect for               in Twitter. IEEE Transactions on Multime-
evaluating competing methods. In this Challenge, we                 dia, 15(6):1268–1282, Oct 2013.
opted for mainstream news and that was reflected in
the way we constructed T ref . However, by pooling          [Pop14] R. Popovici, A. Weiler, M. Grossniklaus. On-
results from participants (the second topic set T ext ),            line Clustering for Real-Time Topic Detection
we also took into account more long-tail topics that                in Social Media Streaming Data. Proceedings
were discovered by some of the methods. Alternative                 of SNOW 2014 Data Challenge, 2014.
evaluation efforts may decide to give more focus on
the latter, since one could argue that discovering topics   [Can14] S. Van Canneyt, M. Feys, S. Schockaert, T.
that are mainstream is of limited value (except if those           Demeester, C. Develder, B. Dhoedt. Detecting
are discovered prior to their appearance in major news             Newsworthy Topics in Twitter. Proceedings of
sources). Conversely, some of these long-tail topics               SNOW 2014 Data Challenge, 2014.
could be popular Twitter memes or jokes that may be
                                                            [Pet14] G. Petkos, S. Papadopoulos, Y. Kompatsiaris.
of limited interest to professional journalists, unless
                                                                    Two-level message clustering for topic detec-
they become enormously popular.
                                                                    tion in Twitter. Proceedings of SNOW 2014
   In conclusion, the problem of topic detection is an              Data Challenge, 2014.
important and attractive research topic, and the con-
tinuous increase of news-oriented social content is ex-     [Bur14] G. Burnside, D. Milioris, P. Jacquet. One
pected to make it even more challenging in the future.              Day in Twitter: Topic Detection Via Joint
The Challenge made clear that properly assessing the                Complexity. Proceedings of SNOW 2014 Data
performance of different methods constitutes a signif-              Challenge, 2014.
icant challenge on its own, and that more such efforts
will be necessary in the future. For such efforts to be     [Ifr14] G. Ifrim, B. Shi, I. Brigadir. Event Detec-
fruitful, the increased participation of numerous re-               tion in Twitter using Aggressive Filtering and
searchers working on the field is invaluable, and there-            Hierarchical Tweet Clustering. Proceedings of
fore special thanks goes to all Data Challenge partici-             SNOW 2014 Data Challenge, 2014.
                       Table 9: Overview of normalized scores and aggregate results

                                   ∗        ∗
          Team                   Rref     Fext    Q∗        C∗      D∗       I∗     AS     Rank
          UKON [Pop14]           0.667   0.615   0.870     0.885   0.611   0.921   0.694     7
          IBCN [Can14]           0.879   0.592   0.998     0.821   0.680   0.540   0.755     4
          ITI [Pet14]            0.485   0.661   0.911     0.942   0.666   0.988   0.710     5
          math-dyn [Bur14]       0.955   0.640   0.931     0.988   0.608   0.885   0.785     3
          Insight [Ifr14]        1.000   1.000   0.961     1.000   0.608   0.466   0.892     1
          FUB-TORV [Ama14]       0.591   0.119   0.848     0.962   0.576      -    0.614    10
          PILOTS [Nut14]         0.364   0.227   1.000     0.972   0.553      -    0.652     9
          RGU [Mar14]            0.909   0.686   0.955     0.849   0.942   1.000   0.842     2
          UoGMIR                 0.258   0.775   0.974     0.795   0.680      -    0.662     8
          EURECOM                0.364   0.062   0.686     0.755   0.720      -    0.546    11
          SNOWBITS [Bha14]       0.212   0.408   0.876     0.877   1.000   0.314   0.710     6
          std. deviation         0.292   0.290   0.090     0.084   0.146   0.282   0.100

[Ama14] G. Amati, S. Angelini, M. Bianchi, G. Gam-       [Mar14] C. Martin-Dancausa, A. Goker. Real-time
       bosi, Gianluca Rossi. Time-based Microblog               topic detection with bursty n-grams. Proceed-
       Distillation Proceedings of SNOW 2014 Data               ings of SNOW 2014 Data Challenge, 2014.
       Challenge, 2014.
                                                         [Bha14] D. Bhatia, V.K. Choudhary, Y. Sharma.
[Nut14] G.C. Nutakki, O. Nasraoui, B. Abdollahi, M.             TwiBiNG: A Bipartite News Generator Us-
        Badami, W. Sun. Distributed LDA based                   ing Twitter. Proceedings of SNOW 2014 Data
        Topic Modeling and Topic Agglomeration in               Challenge, 2014.
        a Latent Space. Proceedings of SNOW 2014
        Data Challenge, 2014.