<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Time-based Microblog Distillation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giambattista Amati</string-name>
          <email>gba@fub.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Angelini</string-name>
          <email>sangelini@fub.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Bianchi</string-name>
          <email>mbianchi@fub.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Gambosi</string-name>
          <email>giorgio.gambosi@uniroma2.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca Rossi</string-name>
          <email>gianluca.rossi@uniroma2.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Ugo Bordoni</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ. of Rome Tor Vergata</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>This paper presents a simple approach for identifying relevant and reliable news from the Twitter stream, as soon as they emerge. The approach is based on a near-real time systems for sentiment analysis on Twitter, implemented by Fondazione Ugo Bordoni, and properly modi ed in order to detect the most representative tweets in a speci ed time slot. This work represents a rst step towards the implementation of a prototype supporting journalists in discovering and nding news on Twitter.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Microblogging is one of the most successful and widely
used paradigm to communicate and interact on online
social networks. According to such paradigm, users
can post short messages that are publicly delivered to
all their followers in real time. In particular Twitter,
the most popular microblogging framework, allows to
exchange messages (tweets) of most 140 chars length.
This constraint is particularly suitable for posting from
mobile devices, as con rmed by statistics on user
access [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Twitter is used as a vehicle for the prompt,
epidemic, di usion of news, in terms of both
announceCopyright c by the paper's authors. Copying permitted only
for private and academic purposes.
ments and comments on topics of general interest [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
though largely applied also for conversation, chatting
or exchanging updates about user activities or
location, that is to exchange information valuable at a
personal level. With its claimed 500 million tweets per
day and more than 200 million active users per month,
(source: Initial public o ering of shares of common
stock of Twitter, Inc.) Twitter turns out as a primary
source of timely information. Being able to discover
relevant news as soon as they are announced, within
the overall tweet stream, turns out to be an
important issue both for journalists and for ordinary news
readers.
      </p>
      <p>This poses several non trivial problems: identifying
emerging topics as collections of related tweets,
recognizing news announcements from other types of
information as soon as possible, determining their freshness
to gather emerging news as quickly as possible,
diversifying accounts of the latest news to avoid reporting
the same information several times, evaluating the
reliability of the news announcement also in terms of
source trustfulness.</p>
      <p>This paper reports the results of a experimentation
aimed to develop a system able to e ectively identify
and report relevant and reliable news from the
Twitter stream, as soon as they emerge. The approach is
based on a near-real time system for sentiment
analysis on Twitter, implemented by the Fondazione Ugo
Bordoni, and properly modi ed in order to detect the
most representative tweets in a speci ed time-slot.</p>
      <p>This work represents a rst step towards the
implementation of a prototype supporting journalists in
discovering and nding news on Twitter. To measure
the e ectiveness of our algorithms we have joined the
SNOW 2014 Data Challenge: the task de ned by
organizers of this challenge is very suitable for our research
purpose. It is worth to note, even if results of this
experimentation seem to be encouraging, we consider
them just a baseline for future experimentations. In
fact, the e ectiveness of our strategy can be improved
both by a better tuning of the system parameters and
by applying more advanced techniques, such as:
timeline analysis to deal with freshness of tweets; sentiment
analysis to detect neutrality, as expected in news
announcements; more sophisticated approaches for tweet
clustering and near duplicate detection.</p>
      <p>The paper is organized as follows: in Section 2 we
brie y introduce the SNOW 2014 Data Challenge task
and the related benchmark. In Section 3 we provide
an architectural overview of the system implemented
by the Fondazione Ugo Bordoni for near-realtime
sentiment analysis on Twitter. In Section 4 we describe
our approach and in Section 5 we present the result
of a preliminary evaluation of our baseline. Section 6
concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task de nition</title>
      <p>The SNOW 2014 Data Challenge de nes a task for
real-time topic detection on Twitter. More precisely,
the task consists in identifying the most relevant topics
in times lots of 15 minutes in the period between
2502-14 (18:00 GMT) and 26-02-14 (18:00 GMT).</p>
      <p>The test data used in the SNOW 2014 Data
Challenge is composed by about one million of tweets1
from the Twitter Stream. The ltering activity has
been conducted by using the Twitter Streaming
API. Tweets have been selected by monitoring four
keywords (i.e. Syria, terror, Ukraine, and bitcoin)
and about 5000 user accounts. Since the monitoring
spanned over 24 hours, the total number of analyzed
time slots were 96. For each time slot and each
discovered topic, a short headline should be yielded,
together a set of representative tweets, possibly
URLs of pictures, and nally a set of keywords. The
expected output format is the following:
time-slot headline keywords tweetIds pictureUrl</p>
      <p>With respect to the SNOW 2014 Challenge task we
ful lled the task providing the following outcomes:
instead of a headline summarizing the discovered topic,
we return the most representative tweet for that topic
and we present its tweetId as representative tweet for
the tweetIds eld.</p>
      <p>1While the SNOW 2014 Data Challenge organizers collected
1.041.062 tweets, we ltered 1.040.362 tweets. Anyway the
difference, in the order of 0.067%, is not statistically signi cant.</p>
    </sec>
    <sec id="sec-3">
      <title>System description</title>
      <p>
        The experimentation has been conducted by using a
system for near-real time sentiment analysis on
Twitter. This system, developed by the Fondazione Ugo
Bordoni, is based on the Terrier framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Figure 1 presents an high level architectural overview of
the system.
      </p>
      <p>
        The Twitter Stream is ltered by Twitter
Connectors, that are software components using the free
Twitter Streaming API. As speci ed by the Twitter
Streaming API Speci cation, each connector can
dene a lter composed of at most 400 keywords and
5000 user accounts. Being the usage of the API for
free, the service provided by Twitter works in a
beste ort fashion: as a consequence, if a lter is too much
noisy (i.e. the number of tweets matching monitored
keywords is too high), Twitter does not guarantees the
delivery of all tweets matching conditions de ned by
the connector. All tweets collected by connectors are
stored into a distributed installation of MongoDB [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] .
Being the platform mainly oriented to implement the
sentiment analysis solution described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the system
includes a Web application for the manual annotation
of tweets and a software component (i.e. Sentiment
Analysis Dictionary Builder ) for the automatic
generation of Dictionaries containing weighted
opinionbearing terms. Dictionaries are used by an extended
version of Terrier, speci cally implemented to support
the indexing of tweets and to enable time-based
mining activities on the indexed collection. The front-end
of the system is provided by a Web application
implementing several tools useful to perform time-based
searches (e.g. search for relevance, search for
freshness, search for opinions), to discover latent concepts
related to a speci ed topics, providing charts, and so
on. Figure 2 shows the Buzz Chart produced by the
Web application with respect to the SNOW 2014 test
collection.
      </p>
      <p>This system has been used to join to the SNOW
2014 Challenge, simply submitting an \empty" query
with respect to the desired time slots and setting the
relevance sorting. The system automatically retrieves
relevant tweets and representative weighting words for
that time slots. In the following Section we detail our
approach for the time-based topic distillation.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimentation</title>
      <p>We have simulated a time-based distillation of tweets
from Twitter streaming assuming that the test
collection is unbiased by ltering keywords, although a very
limited number of keywords were used to lter
Twitter's rehose (e.g. Syria, terror, Ukraine, and bitcoin).
In fact, due to this limited number of keywords, the
collection can not be considered a unbiased sample of
s
r
o
t
c
e
n
n
o
C
r
e
t
t
i
w
T</p>
      <p>Tweets containting keys set n. 1
Tweets containing keys set n. 2
Tweets containing keys set n. 3
Tweets containing keys set n. 4
MongoDB
Analytic Web Tools</p>
      <p>Extended
Terrier</p>
      <p>Dictionaries</p>
      <p>Web Application</p>
      <p>for
manual annotation</p>
      <p>Annotated</p>
      <p>Tweets
Sentimental Analysis
Dictionary Builder
a) We have assumed to process an unbiased
streaming. We have gathered all tweets into time slots of
15 minutes. Thus, we have not searched tweets by
using the four original topics, but we have ltered
the results just by time.
b) We have used a very fast English-based lter. A
d) We have used a very-light and fast
Nearduplicate-detection (NDD) algorithm to remove
tweets from the second pass retrieved set. In
particular, two tweets are considered near-duplicate
if they share a bigram of two not-stopword
consecutive terms. The near duplicate tweet lower in
the ranking was eliminated. We nally presented
the rst three tweets per time slot.
115 - Sergio Aguero has the best minutes per goal
rate in @PremierLeague history scoring on average
every 115 minutes. Delivered.
should be considered a news? To reduce the impact
of the subjectivity, each tweet has been evaluated by
three human experts and classi ed as:
highly relevant (i.e. it is a news), if all three
human experts agree in considering the tweet as
representative for a news;
not relevant (i.e it is not a news), if all three
human experts agree in considering the tweet as not
representative for a news;
relevant (i.e. it seems a news), otherwise.
Since the submitted run contains just 288 records (i.e.
3 tweets times 96 time slots), we performed a
complete manual evaluation in order to evaluate the
precision of our baseline. Our assessment focused on the
relevance of tweets presented as representative for a
news. Interestingly, we quickly realized it is not trivial
to determine what should be classi ed as a news. For
example:
if a tweet contains the substring, it is probably a
representative for a news. This hypothesis is
conrmed by the resulting precision that it is equals
to 0.94;</p>
      <p>The precision of our system (i.e. P@3) varies
between 0.34, if we just consider the highly relevant class,
and 0.58, if we also consider the relevant one. It is
worth noting these results are strongly in uenced by
the choice to return exactly three tweets for each time
slot. In terms of precision, this strategy can be
disadvantageous when a time slot does not contain any
Considering both highly relevant and relevant classes,
we obtain a precision equals to 0.64 and a recall equals
to 0.80.</p>
      <p>Even if we know we performed an incomplete
assessment, we believe this is an encouraging starting point
for the implementation of a vertical system for time
base topic detection on Twitter. The o cial
evaluaif a news emerging from a tweet containing the
substring it is not represented by a tweet in the
submitted run, we missed the news;
if a time slot does not contain any tweet
containing the substring and all tweets in the submitted
run in that time slot have been evaluated as \not
relevant", then the time slot is not considered in
the computation of precision and recall values
because we do not have any evidence of the
existence of a news to be discovered: this hypothesis
simulates the case in which the system is able to
return an empty result when a time slot does not
contain any news. Applying this rule 9 time slots
were removed.</p>
      <p>
        last,defoe,jermain,game,dnipro,tottenham
ukraine,russia,putin,troops
ukraine,troops,putin,russian,news,puts,alert
syria,troops,state,175,media,army
26-02-2014 13:30
26-02-2014 13:30
26-02-2014 13:30
26-02-2014 13:45
26-02-2014 13:45
In this paper we describe our approach in facing a
challenging task: the time-based topic distillation from
microblog. More precisely, we report about the
strategy adopted to submit a preliminary baseline to the
SNOW 2014 Data Challenge and we reported a rst
assessment attempt. Starting from this baseline, we
will explore the following research directions:
a) The use of a topic-based clustering method, e.g.
k-means driven by topic, or of a search-based
result set to further split each time slot into
homogeneous clusters.
b) The ltering of tweets by sentiment polarity.
Sentimental analysis can be indeed useful to detect
neutral tweets, since we assume that breaking
news do not in general contain opinions or
sentiment polarities, unless the news quotes other
people's statements.
c) Freshness and tweet peak analysis improves
retrieval quality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The best representative for
bitcoin,founder,new,mt,still,gox,karpeles,mark
each time-based cluster can be further selected
taking into account topic relevance, diversity and
freshness, not just by diversity and relevance as
we have done with our baseline. Zipf-law, other
fat-tailed distributions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], or exponential
decaying function [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] can enhance early precision. At
the moment we have not used any time-based
retrieval function to order or select the tweet
representatives of the selected news.
d) The NDD algorithm was very restrictive that only
a few tweets were selected among the topmost
relevant retrieved ones. For this reason we have
decided to select only a small number of tweets
per each time slot. If we had used a less
aggressive Near-Duplicate Detection method, for
example with Jaccard's coe cient instead of a
simple bigram sharing condition, then we would have
the possibility to produce a longer list of relevant
and diverse news. Diversity requires thus a
renement of NDD in combination with freshness
and topic relevance. Because of the too
restrictive NDD condition between tweets we have not
produced the list of near duplicate candidate for
each selected tweet. The use of min-wise
independent permutations for NDD [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for Twitter search
can be easily handled with the use of k-grams
with k greater or equal to three, even without
the use of sophisticated similarity functions such
as Jaccard's one. In fact, due to the shortness of
messages (a tweet contains 13 words on average),
there is a high probability of near duplicates to
share only one k-gram in a short slot of time.
Obviously such tight condition would be too
restrictive for larger collections and more importantly
without referencing near duplicates to very short
periods of time. We have thus singled out easily
duplicates not only by removing the tweets
containing the RT word, but also removing tweets
sharing any k-gram. In order to be more selective
in the initial ranking, we have further relaxed this
condition to bigrams (that include entities such
Mark Karpeles, western Russia etc. on Table 1),
but at the moment we cannot evaluate the
corresponding produced loss in recall.
7
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>Fondazione Ugo Bordoni carried out this work in
collaboration with Almawave.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          , E. Ambrosi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gaibisso</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gambosi</surname>
          </string-name>
          .
          <article-title>Automatic construction of an opinion-term vocabulary for ad hoc retrieval</article-title>
          . In C. Macdonald,
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ruthven</surname>
          </string-name>
          , and R. W. White, editors,
          <source>ECIR</source>
          , volume
          <volume>4956</volume>
          of Lecture Notes in Computer Science, pages
          <volume>89</volume>
          {
          <fpage>100</fpage>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          , G. Amodeo, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Gaibisso</surname>
          </string-name>
          .
          <article-title>Survival analysis for freshness in microblogging search</article-title>
          . In X. wen Chen, G. Lebanon,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and M. J. Zaki, editors,
          <source>CIKM</source>
          , pages
          <volume>2483</volume>
          {
          <fpage>2486</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amodeo</surname>
          </string-name>
          , G. Amati, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gambosi</surname>
          </string-name>
          .
          <article-title>On relevance, time and query expansion</article-title>
          .
          <source>In Proceedings of the 20th ACM international conference on Information and knowledge management</source>
          ,
          <source>CIKM '11</source>
          , pages
          <year>1973</year>
          {
          <year>1976</year>
          , New York, NY, USA,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitzenmacher</surname>
          </string-name>
          .
          <article-title>Completeness and robustness properties of min-wise independent permutations</article-title>
          .
          <source>Random Struct. Algorithms</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <volume>18</volume>
          {
          <fpage>30</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chodorow. MongoDB: The De nitive Guide. O'Reilly Media</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Java</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Finin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Tseng</surname>
          </string-name>
          .
          <article-title>Why we twitter: understanding microblogging usage and communities</article-title>
          .
          <source>WebKDD/SNA-KDD'07</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Time-based language models</article-title>
          .
          <source>In Proceedings of the twelfth international conference on Information and knowledge management</source>
          ,
          <source>CIKM '03</source>
          , pages
          <fpage>469</fpage>
          {
          <fpage>475</fpage>
          , New York, NY, USA,
          <year>2003</year>
          . Acm.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Lunden</surname>
          </string-name>
          .
          <article-title>Mobile twitter: 164m+ (75%) access from handheld devices monthly, 65% of ad sales come from mobile</article-title>
          . http://techcrunch.com/
          <year>2013</year>
          /10/03/mobiletwitter-161m
          <article-title>-access-from-handheld-devices-eachmonth-65-of-ad-revenues-coming-from-mobile/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Johnson</surname>
          </string-name>
          .
          <article-title>Terrier information retrieval platform</article-title>
          . In D. E. Losada and
          <string-name>
            <surname>J. M.</surname>
          </string-name>
          Fernandez-Luna, editors,
          <source>ECIR</source>
          , volume
          <volume>3408</volume>
          of Lecture Notes in Computer Science, pages
          <volume>517</volume>
          {
          <fpage>519</fpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Corney</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Aiello</surname>
          </string-name>
          .
          <article-title>Snow 2014 data challenge: Assessing the performance of news topic detection methods in social media</article-title>
          .
          <source>In Proceedings of the SNOW 2014 Data Challenge</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>