<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SNOW 2014 Data Challenge: Assessing the Performance of News Topic Detection Methods in Social Media</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Corney</string-name>
          <email>d.p.a.corney@rgu.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Luca Maria Aiello Yahoo Labs Barcelona</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Robert Gordon Univ. Aberdeen</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Symeon Papadopoulos CERTH-ITI Thessaloniki</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>The SNOW 2014 Data Challenge aimed at creating a public benchmark and evaluation resource for the problem of topic detection in streams of social content. In particular, given a set of tweets spanning a time interval of interest, the Challenge required the extraction of the most signi cant news topics in short timeslots within the selected interval. Here, we provide details with respect to the Challenge de nition, the data collection and evaluation process, and the results achieved by the 11 teams that participated in it, along with a concise retrospective analysis of the main conclusions and arising issues.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Overview</title>
      <p>Consider a scenario of news professionals who use
social media to monitor the newsworthy stories that
emerge from the crowd. The volume of information
is very high and it is often di cult to extract such
stories from a live social media stream. The task of
the SNOW 2014 Data Challenge has been to
automatically mine social streams, in particular Twitter,
to provide journalists with a set of the most important
topics for a number of timeslots of interest. In math
terms, given a set of Twitter messages M spanning the
interval (t0; tmax) and a set of K timeslots of interest
S = fSijSi = (tistart; tiend)g, where ts0tart t0 and
teKnd tmax, the Challenge required participants to
produce K ranked lists of topics, one per timeslot: for
instance, for timeslot Si, one would produce a ranked
list T i = fT1i; T2i; :::; TLi g, where L is the maximum
number of topics allowed per timeslot. Each topic T
is associated with a headline h, a set of tags
(annotations) A, a set of representative tweets M M, and
optionally a set of links to images P . Table 1
summarizes the Challenge terminology.</p>
      <sec id="sec-1-1">
        <title>Symbol</title>
        <p>M
S
Si = (tistart; tiend) 2 S
T i = fT1i; T2i; :::; TLi g
T = (h; A; M; P )</p>
      </sec>
      <sec id="sec-1-2">
        <title>Explanation</title>
        <p>Set of Twitter messages
Set of timeslots of interest
Timeslot i
Ranked list of maximum L
topics for timeslot Si
Topic T consists of a
headline h, a set of tags A, and
set of representative tweets</p>
        <p>M and images P .</p>
        <p>The Challenge stated that the ranking of topics per
timeslot should be based on the newsworthiness of
topics. An operational de nition for newsworthiness was
adopted: for a given timeslot we sought topics that
would turn out to be important enough to be covered
in mainstream news sites.</p>
        <p>In terms of organization, the Challenge proceeded
as follows: An Open Call was published at the
beginning of December 20131, 25 participating teams
registered until the end of January, 11 successfully
submitted runs at the beginning of March 2014, and 9
of them submitted papers describing their approach,
making up the content of the SNOW 2014 Data
Challenge proceedings.</p>
        <p>1http://www.snow-workshop.org/2014/?page_id=37</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Data and Ground Truth</title>
      <p>A total of three sets of tweets were used: a
development set, a rehearsal set and a test set2. The
development set consisted of 1,106,712 tweets that we had
previously collected during the 2012 US Presidential
election3. We had previously analysed these tweets
and produced a ground-truth of mainstream media
stories during the period. We compared several
topicdetection algorithms using this data as described in
[Aie13]. The set of IDs corresponding to these tweets,
with a representative sample of the associated
groundtruth topics, were shared at the start of the challenge
to allow participants to carry out their own internal
evaluations as they developed their systems. To
assist with the tweet collection, we also made available
a simple tweet scraping utility4.</p>
      <p>For the second and third sets, we directed
participants to collect tweets via the Twitter Streaming API
(also making available a simple wrapper utility),
ltering the stream by using provided lists of both users
and keywords. By collecting tweets with the
Streaming API, participants could avoid potential issues with
post-hoc collection of tweets, e.g. via scraping. For
the user list, we shared a previously-generated list of
circa 5000 \newshounds". A newshound is a
Twitter account that tends to report on and discuss
major news events, and includes journalists, news
outlets and agencies, commentators, politicians and other
opinion-formers. In this case, the 5000 selected are
UK-focussed newshounds derived initially from
accounts on several public Twitter lists, and then
enhanced by analysing their followers. Previous work has
shown that using these newshounds to lter the
Twitter stream produces a range of newsworthy tweets.
Note that the Streaming API returns all messages sent
by any user on the list, and also all messages that
mention them. In this way, we collect messages sent to
journalists, e.g. by eye-witnesses or others with
information to share.</p>
      <p>The second set of tweets was designed as a rehearsal
crawl, with the aim of ensuring that the participants
were able to collect and process realistic volumes of
tweets. No ground truth was provided, but
participants could carry out their own informal evaluations.
In addition to the list of UK-focussed newshounds, we
selected keywords based on events around the time of
the crawl. During the rehearsal, UK news was
dominated with stories of ooding in the West of England,
2The development and test are publicly available: http://
figshare.com/articles/SNOW_2014_Data_Challenge/1003755.</p>
      <p>3The original set, also used in [Aie13], was larger, as several
tweets had been removed from Twitter since the collection was
rst made.</p>
      <p>4Source code publicly available on: https://github.com/
socialsensor/twitter-dataset-collector
so we used three keywords: ood, oods, and ooding.</p>
      <p>The third set of tweets formed the test set for the
nal evaluation. We used the same set of UK-focussed
newshounds. For the keywords, we considered which
stories were likely to continue generating widespread
interest and comment, making our nal choice
immediately before the crawl started. On the morning of the
main crawl (25/02/2014), a British national and
former Guantanamo Bay detainee, Moazzam Begg, had
been arrested on terrorism charges related to Syria
and this was likely to be discussed. The uprising in
Ukraine was continuing to generate news stories and
great interest. A major bitcoin exchange (Mt. Gox)
had su ered a major theft, and this story was likely to
generate a lot of comments online, given the
technology angle of the story. We also considered protests in
Venezuela, but an initial search suggested that there
was relatively little UK interest in the events, possibly
due to the fact that much of the rst-hand reporting
was in Spanish. We therefore chose four keywords:
Syria, terror, Ukraine and bitcoin. The test set
collection started on February 25th, 18:00 GMT and lasted
for 24 hours.</p>
      <p>This combination of keywords and newshounds was
expected to produce a substantial but manageable
volume of tweets, covering a wide range of stories
from around the world, but of speci c interest to
UKfocussed journalists. In this way, we could use the
UK mainstream media as a basis for deciding the
\ground truth" list of target topics. It also meant that
we could ignore non-English language messages and
topics, avoiding the complicating issue of translation.
Although it is used globally, Twitter remains
dominated by English-language tweets. The nal number
of tweets collected was 1,041,062, representing an
average of c.720 tweets per minute. Note that the exact
number of tweets that each participant obtained from
the Twitter Stream depends on local network
connectivity and the slightly stochastic nature of the
Streaming API. As we experimentally observed by running
independent collection jobs, this varied by just 0.2%
or 0.3% of the total number of tweets collected. We
therefore shared the ID numbers of all tweets collected,
allowing participants to download any tweets missing
from their local collections. The tweets were sent by
556,295 accounts, contained 648,651 retweets, 135,141
replies and just 8,811 (0.85%) of them were geotagged.</p>
      <p>We also generated the list of reference topics T ref
(ground truth), consisting of 59 topics that were the
basis of mainstream media stories in UK news
outlets during the 24-hour period of the crawl. We
produced this by rst collecting the headlines from both
the BBC RSS news feed5 and from NewsWhip UK6.
From these lists, we merged duplicated stories;
removed some stories of limited local or regional
interest; and removed several commentary, `op-ed' or
speculative pieces. We nally checked that the remaining
stories were represented in the collected tweets, and
removed any that were not. This resulted in 59
distinct stories spread over 24 hours. In principle, an
ideal topic-detection algorithm should be able to
analyse the collection of tweets and identify all 59 stories
as major news events, along with a number of other
events. The aim in creating this ground-truth was not
to be exhaustive (which is e ectively impossible, given
the scale of events taking place around the world in 24
hours, and the imprecise nature of what constitutes
\news"); rather the aim is to produce a wide-ranging
set of news stories covering politics, sports,
international events, con icts and so on, each of which was
signi cant enough to generate substantial mainstream
media and social media coverage.</p>
      <p>For each story Tiref 2 T ref , we identi ed the
approximate time that the story rst appeared; the
headline or label; a list of around three to ve keywords or
named entities de ning the story; a list of two to ve
representative tweets from the collection; and where
appropriate, the URLs of one or more related images,
as shared through the collected tweets. Note that we
did not expect participants to retrieve these speci c
tweets or URLs; they were merely indicative of the
contents of the target topic. This information for each
story was then used by the evaluation team to measure
how e ectively each participating team had discovered
each story.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Protocol</title>
      <p>Participants were asked to produce up to L = 10 topics
per 15-minute timeslot for all timeslots of the 24-hour
test set interval. Thus, each participant could submit
up to 24 4 10 = 960 topics. The topics produced
by participants were submitted in the agreed format
(the same as the one used by the reference topics) to a
web application. After submission, participants could
browse through their topics and upload new versions
of their submission until the submission deadline.</p>
      <p>Subsequently, the evaluation was conducted by
three independent evaluators, located in di erent
countries and organizations. The web-based
submission application also o ered topic annotation features
(cf. Figure 1) that were used to assist them in the
evaluation. The evaluation was done on a set of ve
timeslots (starting at 18:00, 22:00, 23:15 on 25/2, and on
1:00, 1:30 on 26/2), and was blind, i.e. the evaluators
5http://feeds.bbc.co.uk/news/rss.xml
6http://www.newswhip.com/U.K.
did not know which participant produced any topic
they evaluated. The resulting topic annotations were
saved in a relational database, and aggregate
statistics and results were derived with the use of SQL and
some further programmatic post-processing of results
in some cases.</p>
      <p>As described in the Challenge page, four evaluation
criteria were used: a) precision-recall, b) readability,
c) coherence/relevance, d) diversity. The rst would
be quanti ed by means of the F-score (0-1), while the
other three would be assessed on a ve-level Likert
scale. In the following, we provide further details with
respect to the computation of the above measures. In
addition, submissions were evaluated with respect to
image relevance, by means of a precision score (0-1),
but this was not taken into account for the nal
ranking since associating images with topics was optional.
3.1</p>
      <sec id="sec-3-1">
        <title>Precision-recall</title>
        <p>Precision and recall were derived with respect to two
sets of reference topics: The rst, T ref , comprised the
59 topics manually created by the organizers as
described above, while the second, denoted as T ext, was
created in a pooled way based on the submissions of
participants during the ve selected timeslots. More
speci cally, the evaluators assessed (using a tick box)
all submitted topics during those ve timeslots as
being newsworthy or not (cf. paragraph 3.1.1).
Topics that received at least two votes by evaluators were
included in a list. After removing duplicates, a set
of jT extj = 70 participant-pooled topics were de ned.
Note that a few of those topics were also included in
T ref .</p>
        <p>In the case of T ref we computed only recall: for
each participant and for each topic of T ref , the
evaluators identi ed, with the help of a text-search facility
o ered by the evaluation web application, at least one
matching topic in the full set of submitted topics7. In
the end, for each participant v, we computed a
recall score Rref (v) 2 [0; 1] by dividing the number of
matched topics Ncref (v) with 59. Note that each
evaluator performed the matching described above for a
part (approximately one-third) of the 59 topics.</p>
        <p>In the case of T ext, evaluators manually matched
the topics of each participant during the ve selected
timeslots to the topics of T ext. After the matching, we
could easily compute for each participant v the number
of correctly matched topics Ncext(v) and the number
of unique correctly matched topics Ncext (v) (since a
participant might detect the same topic in multiple
timeslots). Then, for each participant we could
com7This was the only case where the full set of submitted topics
submitted by participants was used.
pute precision and recall as follows:
where N (v) is the total number of topics submitted by
v during the ve selected timeslots. On the basis of
precision and recall, F-score was computed as usual:
Evaluators assessed each of the submitted topics
(belonging to the ve selected timeslots) as being
newsworthy or not based on the positive and negative
examples of Table 2.
Evaluators were instructed to assign a score between
1 and 5 (half points were also possible) according to
the guidelines of Table 3. For each participant v, the
readability score Q(v) was computed only on the basis
of the newsworthy topics, and by averaging over the
three evaluators.
A similar process was followed for computing
coherence C(v), this time using the guidelines of Table 4.
The main criterion for assessing coherence is the
relevance of the representative tweets with the topic
headline. In addition, apart from the headline, evaluators
were also instructed to consider tags: in case some of
them were found to be irrelevant to the topic headline,
they should decrease the coherence score (accordingly
to the number of irrelevant tags). Finally, evaluators
were instructed to ignore near-duplicate tweets (i.e.
neither penalize nor increase the topic coherence).
To compute diversity D(v), evaluators were instructed
to look into the number of di erent tweets associated
with a topic: to consider a tweet as di erent from
another, the tweet should convey some additional
information. Moreover, compared to a topic that does
not contain any duplication in its tweets, a topic with
duplication should be slightly penalized. Depending
on the degree of duplication, one may subtract 0.5 to
1 points from the score that they would otherwise
assign. Table 5 provides further guidelines on assigning
diversity scores.
To assess the relevance of an image, the evaluators
needed to tick a special box in case they found the
image(s) relevant to the topic under test. In cases
of multiple pictures, the evaluators should make their
decision based on the general impression. In the end,
a single image relevance score I(v) was computed for
(a) Q=5, C=5, D=2
(b) Q=3, C=5, D=2
each participant by computing the percentage of
relevant images in the set of newsworthy topics and
averaging over the three evaluators.
3.6</p>
      </sec>
      <sec id="sec-3-2">
        <title>Normalization and aggregation</title>
        <p>For each of the scores used for the ranking, Rref , Fext,
Q, C and D, we rst identi ed the maximum attained
scores Rrmeafx, Femxtax, Qmax, Cmax and Dmax, and then
normalized the scores of each participant with respect
to the latter. For instance, the normalized readability
score would be:</p>
        <p>Q (v) =</p>
        <p>Q(v)
Qmax
(3)
In the end, the aggregate score for each participant
was derived by the following equation:</p>
        <p>AS(v) = 0:25 Rref (v) Fext(v) + 0:25 Q (v)
+ 0:25 C (v) + 0:25 D (v) (4)
This was the score used to derive the nal ranking for
the Challenge.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Table 6 presents the raw scores achieved for each of
the selected evaluation measures by the 11
participating teams. It is noteworthy that for each evaluation
measure there is a di erent best method. For instance,
the method by Insight [Ifr14] is clearly best in terms
of recall (in both T ref and T ext) and coherence, the
method by PILOTS [Nut14] best in terms of
readability, while the method by SNOWBITS [Bha14] best in
terms of diversity.</p>
      <p>A second noteworthy conclusion is that almost all
methods produce high-quality topic headlines (Q &gt; 4)
and mostly coherent topics (C &gt; 4). However, the
majority of methods su er from decreased diversity
(D &lt; 3). This can be explained by the fact that the
majority of topics produced by methods are associated
with very few tweets (typically between one and three)
resulting into very tight topics. Typically, while some
tweets about the same topic share many of the same
terms, other tweets will use distinct words. Methods
that form topics based on textual-similarity of
messages may therefore tend to produce these tight topics
with very low diversity. Finally, in terms of image
relevance, several methods managed to achieve
satisfactory performance, with more than half of the
proposed images being considered as relevant to the topic
headline (I &gt; 50%).</p>
      <p>Table 9 presents the normalized scores for each
criterion, the aggregate score and the nal ranking for
all participants. The three winning teams are Insight
[Ifr14], RGU [Mar14] and math-dyn [Bur14]. One may
conclude that the most distinguishing evaluation
measure is topic recall and F-score with standard
deviations of 0.292 and 0.29 respectively across participants,
while the least discriminative measures are coherence
and readability with standard deviations of 0.084 and
0.09 respectively.</p>
      <p>Another interesting conclusion can be drawn by
looking into the absolute number of unique topics that
each method discovered within the ve selected
timeslots (Table 7). The method with the highest recall
[Ifr14] managed to discover 25 of the 70 topics of T ext.
Given that those 70 topics are the result of topic
pooling over the results of all methods, we may conclude
that there is much room for improving topic recall by
combining the results of multiple methods (ensemble
topic discovery).
4.1</p>
      <sec id="sec-4-1">
        <title>Robustness of results</title>
        <p>To make sure that the evaluation results are robust,
we looked into the following: a) inter-annotator
agreement (quanti ed by computing the pairwise
correlations of the evaluation distributions), b) alternative
score aggregation methods. With respect to the rst,
we were pleased to note that there was signi cant
agreement between all three evaluators across most
of the evaluation aspects (readability appears to be
the most subjective of all) as Table 8 testi es. With
respect to the latter, we were positively surprised by
the fact that several alternative normalization and
aggregation schemes led to very similar rankings. More
speci cally, the rst three methods remained the same
for a number of di erent variations based on two
schemes:
changing the weights of the aggregation scheme of
Equation 4 (instead of setting them all equal to
0.25);
subtracting the average value for each score
(instead of just dividing with the maximum value).
The stability of results over di erent normalization
and aggregation schemes gives more con dence and
credibility on the derived ranking.
In retrospect, the SNOW 2014 Data Challenge
managed to bring together a number of researchers
working on the problem of topic detection in noisy text
streams. Conducting a fair and thorough evaluation
of the competing methods proved to be a highly
complicated task, calling for a variety of evaluation criteria
and in-depth analysis. The results of this report along
with the descriptions of the referenced methods o er
a number of lessons and valuable resources to the
researchers working on the eld.</p>
        <p>At this point, we should highlight a few
limitations of the evaluation approach. A rst one concerns
the limited number of timeslots (and hence topics)
assessed by the evaluators due to the limited time and
resources available for evaluation. In the future, one
should consider the use of crowdsourcing platforms in
order to increase the breadth of the evaluation. In
addition, the evaluation was limited to a speci c
timeslot size (15 minutes), targeting a nearly real-time
scenario. Assessing the performance over larger timeslots
(e.g. hour, day) could also be considered valuable for
a number of applications; however, one should keep
away from extrapolating the conclusions drawn from
this Challenge to those settings, as the performance
of di erent methods may be a ected in di erent ways
with the increase of timeslots (some methods might</p>
        <p>Yet another limitation of the conducted evaluation
pertains to assessing the timeliness of detected topics.
When matching the submitted topics against the
reference topics T ref , the evaluators completely ignored the
temporal information. In that way, a method that
discovered a topic early on would be considered equally
good with one that discovered the same topic many
hours later. Obviously, this is an important
performance aspect, especially in the context of breaking
news detection, which should be taken into account in
future evaluation e orts.</p>
        <p>Last but not least, we should acknowledge that the
type of topics sought is another important aspect for
evaluating competing methods. In this Challenge, we
opted for mainstream news and that was re ected in
the way we constructed T ref . However, by pooling
results from participants (the second topic set T ext),
we also took into account more long-tail topics that
were discovered by some of the methods. Alternative
evaluation e orts may decide to give more focus on
the latter, since one could argue that discovering topics
that are mainstream is of limited value (except if those
are discovered prior to their appearance in major news
sources). Conversely, some of these long-tail topics
could be popular Twitter memes or jokes that may be
of limited interest to professional journalists, unless
they become enormously popular.</p>
        <p>In conclusion, the problem of topic detection is an
important and attractive research topic, and the
continuous increase of news-oriented social content is
expected to make it even more challenging in the future.
The Challenge made clear that properly assessing the
performance of di erent methods constitutes a
significant challenge on its own, and that more such e orts
will be necessary in the future. For such e orts to be
fruitful, the increased participation of numerous
researchers working on the eld is invaluable, and
therefore special thanks goes to all Data Challenge
participants for their hard work and patience throughout the
Challenge.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Acknowledgements</title>
        <p>This work has been supported by the SocialSensor FP7
project, partially funded by the EC under contract
number 287975.
Team Rref
UKON [Pop14] 0.667
IBCN [Can14] 0.879
ITI [Pet14] 0.485
math-dyn [Bur14] 0.955
Insight [Ifr14] 1.000
FUB-TORV [Ama14] 0.591
PILOTS [Nut14] 0.364
RGU [Mar14] 0.909
UoGMIR 0.258
EURECOM 0.364
SNOWBITS [Bha14] 0.212
std. deviation 0.292
Fext
0.615
0.592
0.661
0.640
1.000
0.119
0.227
0.686
0.775
0.062
0.408
0.290</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Aie13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Aiello</surname>
          </string-name>
          , G. Petkos,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Skraba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goker</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kompatsiaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaimes</surname>
          </string-name>
          .
          <article-title>Sensing trending topics in Twitter</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>15</volume>
          (
          <issue>6</issue>
          ):
          <volume>1268</volume>
          {
          <fpage>1282</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Pop14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Popovici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Weiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grossniklaus</surname>
          </string-name>
          .
          <article-title>Online Clustering for Real-Time Topic Detection in Social Media Streaming Data</article-title>
          .
          <source>Proceedings of SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Can14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Van Canneyt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Demeester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Develder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhoedt</surname>
          </string-name>
          .
          <article-title>Detecting Newsworthy Topics in Twitter</article-title>
          .
          <source>Proceedings of SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Pet14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Petkos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <article-title>Two-level message clustering for topic detection in Twitter</article-title>
          .
          <source>Proceedings of SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Bur14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Burnside</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Milioris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jacquet</surname>
          </string-name>
          .
          <article-title>One Day in Twitter: Topic Detection Via Joint Complexity</article-title>
          .
          <source>Proceedings of SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Ifr14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ifrim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Brigadir. Event</surname>
          </string-name>
          <article-title>Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering</article-title>
          .
          <source>Proceedings of SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Ama14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Angelini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          , G. Gambosi,
          <string-name>
            <given-names>Gianluca</given-names>
            <surname>Rossi</surname>
          </string-name>
          .
          <source>Time-based Microblog Distillation Proceedings of SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Nut14]
          <string-name>
            <given-names>G.C.</given-names>
            <surname>Nutakki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Abdollahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Badami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Distributed LDA based Topic Modeling and Topic Agglomeration in a Latent Space</article-title>
          .
          <source>Proceedings of SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Mar14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Martin-Dancausa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goker</surname>
          </string-name>
          .
          <article-title>Real-time topic detection with bursty n-grams</article-title>
          .
          <source>Proceedings of SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Bha14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.K.</given-names>
            <surname>Choudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sharma. TwiBiNG: A Bipartite News</surname>
          </string-name>
          <article-title>Generator Using Twitter</article-title>
          .
          <source>Proceedings of SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>