=Paper=
{{Paper
|id=Vol-1150/amati
|storemode=property
|title=Time-based Microblog Distillation
|pdfUrl=https://ceur-ws.org/Vol-1150/amati.pdf
|volume=Vol-1150
|dblpUrl=https://dblp.org/rec/conf/www/AmatiA0GR14
}}
==Time-based Microblog Distillation==
<pdf width="1500px">https://ceur-ws.org/Vol-1150/amati.pdf</pdf>
<pre>
                         Time-based Microblog Distillation

        Giambattista Amati                       Simone Angelini                       Marco Bianchi
      Fondazione Ugo Bordoni                  Fondazione Ugo Bordoni               Fondazione Ugo Bordoni
           Rome, Italy                              Rome, Italy                         Rome, Italy
            gba@fub.it                            sangelini@fub.it                    mbianchi@fub.it

                          Giorgio Gambosi                               Gianluca Rossi
                     Univ. of Rome Tor Vergata                    Univ. of Rome Tor Vergata
                             Rome, Italy                                 Rome, Italy
                    giorgio.gambosi@uniroma2.it                   gianluca.rossi@uniroma2.it


                                                                ments and comments on topics of general interest [6],
                                                                though largely applied also for conversation, chatting
                       Abstract                                 or exchanging updates about user activities or loca-
                                                                tion, that is to exchange information valuable at a
    This paper presents a simple approach for                   personal level. With its claimed 500 million tweets per
    identifying relevant and reliable news from the             day and more than 200 million active users per month,
    Twitter stream, as soon as they emerge. The                 (source: Initial public offering of shares of common
    approach is based on a near-real time sys-                  stock of Twitter, Inc.) Twitter turns out as a primary
    tems for sentiment analysis on Twitter, im-                 source of timely information. Being able to discover
    plemented by Fondazione Ugo Bordoni, and                    relevant news as soon as they are announced, within
    properly modified in order to detect the most               the overall tweet stream, turns out to be an impor-
    representative tweets in a specified time slot.             tant issue both for journalists and for ordinary news
    This work represents a first step towards                   readers.
    the implementation of a prototype supporting
                                                                   This poses several non trivial problems: identifying
    journalists in discovering and finding news on
                                                                emerging topics as collections of related tweets, recog-
    Twitter.
                                                                nizing news announcements from other types of infor-
                                                                mation as soon as possible, determining their freshness
1    Introduction                                               to gather emerging news as quickly as possible, diver-
Microblogging is one of the most successful and widely          sifying accounts of the latest news to avoid reporting
used paradigm to communicate and interact on online             the same information several times, evaluating the re-
social networks. According to such paradigm, users              liability of the news announcement also in terms of
can post short messages that are publicly delivered to          source trustfulness.
all their followers in real time. In particular Twitter,           This paper reports the results of a experimentation
the most popular microblogging framework, allows to             aimed to develop a system able to effectively identify
exchange messages (tweets) of most 140 chars length.            and report relevant and reliable news from the Twit-
This constraint is particularly suitable for posting from       ter stream, as soon as they emerge. The approach is
mobile devices, as confirmed by statistics on user ac-          based on a near-real time system for sentiment anal-
cess [8].                                                       ysis on Twitter, implemented by the Fondazione Ugo
   Twitter is used as a vehicle for the prompt, epi-            Bordoni, and properly modified in order to detect the
demic, diffusion of news, in terms of both announce-            most representative tweets in a specified time-slot.
                                                                   This work represents a first step towards the im-
Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes.
                                                                plementation of a prototype supporting journalists in
In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings
                                                                discovering and finding news on Twitter. To measure
of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014,      the effectiveness of our algorithms we have joined the
published at http://ceur-ws.org                                 SNOW 2014 Data Challenge: the task defined by orga-
nizers of this challenge is very suitable for our research           3    System description
purpose. It is worth to note, even if results of this
                                                                     The experimentation has been conducted by using a
experimentation seem to be encouraging, we consider
                                                                     system for near-real time sentiment analysis on Twit-
them just a baseline for future experimentations. In
                                                                     ter. This system, developed by the Fondazione Ugo
fact, the effectiveness of our strategy can be improved
                                                                     Bordoni, is based on the Terrier framework [9]. Fig-
both by a better tuning of the system parameters and
                                                                     ure 1 presents an high level architectural overview of
by applying more advanced techniques, such as: time-
                                                                     the system.
line analysis to deal with freshness of tweets; sentiment
                                                                        The Twitter Stream is filtered by Twitter Con-
analysis to detect neutrality, as expected in news an-
                                                                     nectors, that are software components using the free
nouncements; more sophisticated approaches for tweet
                                                                     Twitter Streaming API. As specified by the Twitter
clustering and near duplicate detection.
                                                                     Streaming API Specification, each connector can de-
   The paper is organized as follows: in Section 2 we                fine a filter composed of at most 400 keywords and
briefly introduce the SNOW 2014 Data Challenge task                  5000 user accounts. Being the usage of the API for
and the related benchmark. In Section 3 we provide                   free, the service provided by Twitter works in a best-
an architectural overview of the system implemented                  effort fashion: as a consequence, if a filter is too much
by the Fondazione Ugo Bordoni for near-realtime sen-                 noisy (i.e. the number of tweets matching monitored
timent analysis on Twitter. In Section 4 we describe                 keywords is too high), Twitter does not guarantees the
our approach and in Section 5 we present the result                  delivery of all tweets matching conditions defined by
of a preliminary evaluation of our baseline. Section 6               the connector. All tweets collected by connectors are
concludes the paper.                                                 stored into a distributed installation of MongoDB [5] .
                                                                     Being the platform mainly oriented to implement the
                                                                     sentiment analysis solution described in [1], the system
2    Task definition                                                 includes a Web application for the manual annotation
                                                                     of tweets and a software component (i.e. Sentiment
The SNOW 2014 Data Challenge defines a task for                      Analysis Dictionary Builder ) for the automatic gen-
real-time topic detection on Twitter. More precisely,                eration of Dictionaries containing weighted opinion-
the task consists in identifying the most relevant topics            bearing terms. Dictionaries are used by an extended
in times lots of 15 minutes in the period between 25-                version of Terrier, specifically implemented to support
02-14 (18:00 GMT) and 26-02-14 (18:00 GMT).                          the indexing of tweets and to enable time-based min-
                                                                     ing activities on the indexed collection. The front-end
   The test data used in the SNOW 2014 Data Chal-
                                                                     of the system is provided by a Web application im-
lenge is composed by about one million of tweets1
                                                                     plementing several tools useful to perform time-based
from the Twitter Stream. The filtering activity has
                                                                     searches (e.g. search for relevance, search for fresh-
been conducted by using the Twitter Streaming
                                                                     ness, search for opinions), to discover latent concepts
API. Tweets have been selected by monitoring four
                                                                     related to a specified topics, providing charts, and so
keywords (i.e. Syria, terror, Ukraine, and bitcoin)
                                                                     on. Figure 2 shows the Buzz Chart produced by the
and about 5000 user accounts. Since the monitoring
                                                                     Web application with respect to the SNOW 2014 test
spanned over 24 hours, the total number of analyzed
                                                                     collection.
time slots were 96. For each time slot and each
                                                                        This system has been used to join to the SNOW
discovered topic, a short headline should be yielded,
                                                                     2014 Challenge, simply submitting an “empty” query
together a set of representative tweets, possibly
                                                                     with respect to the desired time slots and setting the
URLs of pictures, and finally a set of keywords. The
                                                                     relevance sorting. The system automatically retrieves
expected output format is the following:
                                                                     relevant tweets and representative weighting words for
                                                                     that time slots. In the following Section we detail our
time-slot headline keywords tweetIds pictureUrl
                                                                     approach for the time-based topic distillation.
   With respect to the SNOW 2014 Challenge task we
fulfilled the task providing the following outcomes: in-
                                                                     4    Experimentation
stead of a headline summarizing the discovered topic,
we return the most representative tweet for that topic               We have simulated a time-based distillation of tweets
and we present its tweetId as representative tweet for               from Twitter streaming assuming that the test collec-
the tweetIds field.                                                  tion is unbiased by filtering keywords, although a very
                                                                     limited number of keywords were used to filter Twit-
   1 While the SNOW 2014 Data Challenge organizers collected         ter’s firehose (e.g. Syria, terror, Ukraine, and bitcoin).
1.041.062 tweets, we filtered 1.040.362 tweets. Anyway the dif-      In fact, due to this limited number of keywords, the
ference, in the order of 0.067%, is not statistically significant.   collection can not be considered a unbiased sample of
                                              Tweets containting keys set n. 1


                  Twitter Connectors
                                              Tweets containing keys set n. 2                    Web Application
                                                                                  MongoDB               for
                                                                                                 manual annotation
                                              Tweets containing keys set n. 3


                                              Tweets containing keys set n. 4                              Annotated
                                                                                                             Tweets


                                                                                                Sentimental Analysis
                               Analytic Web Tools                     Extended   Dictionaries
                                                                       Terrier                   Dictionary Builder


          Figure 1: Architectural overview of our system for near-realtime sentiment analysis on Twitter
Twitter’s firehose (about the 30% of tweets in the col-         set of English common terms was submitted as
lection contains the above mentioned terms). As a               an unweighted query to the system and it was
consequence, any methodology exploiting the knowl-              searched against the inverted index in order to
edge of these keywords could have been easily affected          produce a first pass retrieval. This lexicon was
by overfitting. For a similar reason, and the sake of           used to eliminate not-English tweets from the
generality, we have not used the list of seed authors           streaming. We have used a stopword list of 453
for filtering the news, as well as to assume ad hoc so-         English words as a query to filter tweets writ-
lutions for the type of task, such as the use of specific       ten in English, and thus reducing the collection
or clue keywords to detect the breaking news (e.g. the          to 94.10% of the original size. The error rate of
word BREAKING at the beginning of the tweets) or                not-English tweets after retrieval was 9.03% in the
the id of the seed authors. However, we have deliber-           sample of submitted list to the SNOW competi-
ately removed retweets from the retrieved set because           tion. We have not yet statistics on the error rate
this was an explicit requirement of the task.                   for the false negative not-retrieved set.
   Since we did not have criteria or evaluation measure
to assess the quality of the result set, the values of the
parameters were given either by default as provided by       c) Though we have not used a query for the first
our system or by empirical evidence. The training set           pass retrieval, we have ranked the tweets of each
thus was only used to qualitatively provide an evalua-          time slot by relevance using a query expansion
tion of the distilled result set of tweets, and it was not      technique. We have applied the Bose-Einstein
used to tune the parameters of the filtering system.            query expansion weights to determine the new
   We have submitted a run that will be used as a               term queries. Bose-Einstein (BO) weight is a vari-
benchmark to evaluate our future experiments, on the            ant of the Kullback-Leibler divergence (KL) and
basis of the evaluation measures that will be provided          is preferred to KL when recall is more important
by the organizers [10]. For this reason we consider the         than early precision, as required in our case by the
submitted run just as a baseline for future experimen-          absence of a topic-based first pass retrieval [3].
tations.
   To submit our baseline we have tackled the follow-
ing issues:
                                                             d) We have used a very-light and fast Near-
  a) We have assumed to process an unbiased stream-             duplicate-detection (NDD) algorithm to remove
      ing. We have gathered all tweets into time slots of       tweets from the second pass retrieved set. In par-
      15 minutes. Thus, we have not searched tweets by          ticular, two tweets are considered near-duplicate
      using the four original topics, but we have filtered      if they share a bigram of two not-stopword con-
      the results just by time.                                 secutive terms. The near duplicate tweet lower in
                                                                the ranking was eliminated. We finally presented
  b) We have used a very fast English-based filter. A           the first three tweets per time slot.
            Figure 2: The Buzz chart shows the number of tweets of the SNOW 2014 test collection
5     Evaluation                                           news. Furthermore, we tried to improve the compu-
                                                           tation of the precision value and to get a sense of the
Since the submitted run contains just 288 records (i.e.
                                                           performance in terms of recall: for each time slot we
3 tweets times 96 time slots), we performed a com-
                                                           have extracted all tweets containing the term BREAK-
plete manual evaluation in order to evaluate the pre-
                                                           ING, as substring, from the result set used to create
cision of our baseline. Our assessment focused on the
                                                           the submitted run. In this case we assume:
relevance of tweets presented as representative for a
news. Interestingly, we quickly realized it is not trivial   • if a tweet contains the substring, it is probably a
to determine what should be classified as a news. For           representative for a news. This hypothesis is con-
example:                                                        firmed by the resulting precision that it is equals
                                                    to 0.94;
115 - Sergio Aguero has the best minutes per goal
rate in @PremierLeague history scoring on average • if a news emerging from a tweet containing the
every 115 minutes. Delivered.                       substring it is not represented by a tweet in the
                                                                  submitted run, we missed the news;
should be considered a news? To reduce the impact
of the subjectivity, each tweet has been evaluated by           • if a time slot does not contain any tweet contain-
three human experts and classified as:                            ing the substring and all tweets in the submitted
                                                                  run in that time slot have been evaluated as “not
    • highly relevant (i.e. it is a news), if all three hu-
                                                                  relevant”, then the time slot is not considered in
      man experts agree in considering the tweet as rep-
                                                                  the computation of precision and recall values be-
      resentative for a news;
                                                                  cause we do not have any evidence of the exis-
    • not relevant (i.e it is not a news), if all three hu-       tence of a news to be discovered: this hypothesis
      man experts agree in considering the tweet as not           simulates the case in which the system is able to
      representative for a news;                                  return an empty result when a time slot does not
                                                                  contain any news. Applying this rule 9 time slots
    • relevant (i.e. it seems a news), otherwise.                 were removed.

   The precision of our system (i.e. P@3) varies be-          Considering both highly relevant and relevant classes,
tween 0.34, if we just consider the highly relevant class,    we obtain a precision equals to 0.64 and a recall equals
and 0.58, if we also consider the relevant one. It is         to 0.80.
worth noting these results are strongly influenced by            Even if we know we performed an incomplete assess-
the choice to return exactly three tweets for each time       ment, we believe this is an encouraging starting point
slot. In terms of precision, this strategy can be dis-        for the implementation of a vertical system for time
advantageous when a time slot does not contain any            base topic detection on Twitter. The official evalua-
                    Table 1: An extract of records submitted to the SNOW 2014 Challenge.
     Times-lot                            Headline                    Tags
                         Mt. Gox Founder Issues New Statement,
 26-02-2014 13:30        Says He’s Still in Japan: Mark Karpeles,     bitcoin,founder,new,mt,still,gox,karpeles,mark
                         founder of troubled bitcoin exc...
                          Jermain Defoe has played his last game
 26-02-2014 13:30         for Tottenham after being ruled out of      last,defoe,jermain,game,dnipro,tottenham
                          tomorrows match against Dnipro
                       Putin puts troops in western Russia on alert
 26-02-2014 13:30      amid Ukraine tension: President                ukraine,russia,putin,troops
                       Vladimir Putin ordered an urgent ...
                      [GMA News] Alarm in Ukraine as Putin puts
 26-02-2014 13:45     Russian troops on alert: SIMFEROPOL,            ukraine,troops,putin,russian,news,puts,alert
                      Ukraine/KIEV - President Vladim...
                     Syria State Media Says Army Kills 175 Rebels:
 26-02-2014 13:45    State media in Syria says army troops have       syria,troops,state,175,media,army
                     killed 175 rebels in an ambush south of Dama...


                   Table 2: An extract of tweets containing the term BREAKING as substring.
     Time-slot                             Headline                       Tags
                       BREAKING: Reports say Vladimir Putin has
  26-02-2014 13:00 ordered a test of combat readiness of troops in ukraine,troops,putin,russia,combat,readiness
                       central, western Russia.”Ukraine bound!
                        BREAKING: State media in Syria says army
  26-02-2014 13:30      troops have killed 175 rebels in an ambush        syria,troops,damascus
                        south of Damascus.
                          BREAKING: Mid Staffs NHS trust to be
  26-02-2014 14:15                                                        -
                          dissolved, Jeremy Hunt announces
tion results of our method in the Data Challenge are             each time-based cluster can be further selected
included in [10].                                                taking into account topic relevance, diversity and
                                                                 freshness, not just by diversity and relevance as
6 Conclusion and Future Work                                     we have done with our baseline. Zipf-law, other
                                                                 fat-tailed distributions [2], or exponential decay-
In this paper we describe our approach in facing a
                                                                 ing function [7] can enhance early precision. At
challenging task: the time-based topic distillation from
                                                                 the moment we have not used any time-based re-
microblog. More precisely, we report about the strat-
                                                                 trieval function to order or select the tweet repre-
egy adopted to submit a preliminary baseline to the
                                                                 sentatives of the selected news.
SNOW 2014 Data Challenge and we reported a first
assessment attempt. Starting from this baseline, we
will explore the following research directions:               d) The NDD algorithm was very restrictive that only
                                                                 a few tweets were selected among the topmost
  a) The use of a topic-based clustering method, e.g.            relevant retrieved ones. For this reason we have
     k-means driven by topic, or of a search-based re-           decided to select only a small number of tweets
     sult set to further split each time slot into homo-         per each time slot. If we had used a less ag-
     geneous clusters.                                           gressive Near-Duplicate Detection method, for ex-
  b) The filtering of tweets by sentiment polarity. Sen-         ample with Jaccard’s coefficient instead of a sim-
     timental analysis can be indeed useful to detect            ple bigram sharing condition, then we would have
     neutral tweets, since we assume that breaking               the possibility to produce a longer list of relevant
     news do not in general contain opinions or sen-             and diverse news. Diversity requires thus a re-
     timent polarities, unless the news quotes other             finement of NDD in combination with freshness
     people’s statements.                                        and topic relevance. Because of the too restric-
                                                                 tive NDD condition between tweets we have not
  c) Freshness and tweet peak analysis improves re-              produced the list of near duplicate candidate for
     trieval quality [2]. The best representative for            each selected tweet. The use of min-wise indepen-
    dent permutations for NDD [4] for Twitter search           ings of the 20th ACM international conference on
    can be easily handled with the use of k-grams              Information and knowledge management, CIKM
    with k greater or equal to three, even without             ’11, pages 1973–1976, New York, NY, USA, 2011.
    the use of sophisticated similarity functions such         Acm.
    as Jaccard’s one. In fact, due to the shortness of
    messages (a tweet contains 13 words on average),        [4] A. Z. Broder and M. Mitzenmacher. Complete-
    there is a high probability of near duplicates to           ness and robustness properties of min-wise in-
    share only one k-gram in a short slot of time. Ob-          dependent permutations. Random Struct. Algo-
    viously such tight condition would be too restric-          rithms, 18(1):18–30, 2001.
    tive for larger collections and more importantly        [5] K. Chodorow. MongoDB: The Definitive Guide.
    without referencing near duplicates to very short           O’Reilly Media, 2013.
    periods of time. We have thus singled out easily
    duplicates not only by removing the tweets con-         [6] A. Java, X. Song, T. Finin, and B. Tseng. Why we
    taining the RT word, but also removing tweets               twitter: understanding microblogging usage and
    sharing any k-gram. In order to be more selective           communities. WebKDD/SNA-KDD’07, 2007.
    in the initial ranking, we have further relaxed this
                                                            [7] X. Li and W. B. Croft. Time-based language mod-
    condition to bigrams (that include entities such
                                                                els. In Proceedings of the twelfth international
    Mark Karpeles, western Russia etc. on Table 1),
                                                                conference on Information and knowledge man-
    but at the moment we cannot evaluate the corre-
                                                                agement, CIKM ’03, pages 469–475, New York,
    sponding produced loss in recall.
                                                                NY, USA, 2003. Acm.
7   Acknowledgments                                         [8] I. Lunden.     Mobile twitter: 164m+ (75%)
Fondazione Ugo Bordoni carried out this work in col-            access from handheld devices monthly,
laboration with Almawave.                                       65% of ad sales come from mobile.
                                                                http://techcrunch.com/2013/10/03/mobile-
References                                                      twitter-161m-access-from-handheld-devices-each-
                                                                month-65-of-ad-revenues-coming-from-mobile/.
 [1] G. Amati, E. Ambrosi, M. Bianchi, C. Gaibisso,
     and G. Gambosi. Automatic construction of              [9] I. Ounis, G. Amati, V. Plachouras, B. He, C. Mac-
     an opinion-term vocabulary for ad hoc retrieval.           donald, and D. Johnson. Terrier information re-
     In C. Macdonald, I. Ounis, V. Plachouras,                  trieval platform. In D. E. Losada and J. M.
     I. Ruthven, and R. W. White, editors, ECIR, vol-           Fernández-Luna, editors, ECIR, volume 3408 of
     ume 4956 of Lecture Notes in Computer Science,             Lecture Notes in Computer Science, pages 517–
     pages 89–100. Springer, 2008.                              519. Springer, 2005.
 [2] G. Amati, G. Amodeo, and C. Gaibisso. Survival        [10] S. Papadopoulos, D. Corney, and L. M. Aiello.
     analysis for freshness in microblogging search. In         Snow 2014 data challenge: Assessing the perfor-
     X. wen Chen, G. Lebanon, H. Wang, and M. J.                mance of news topic detection methods in social
     Zaki, editors, CIKM, pages 2483–2486. ACM,                 media. In Proceedings of the SNOW 2014 Data
     2012.                                                      Challenge, 2014.
 [3] G. Amodeo, G. Amati, and G. Gambosi. On rel-
     evance, time and query expansion. In Proceed-

</pre>