-

Time-based Microblog Distillation

Giambattista Amati

gba@fub.it 0

Simone Angelini

sangelini@fub.it 0

Marco Bianchi

mbianchi@fub.it 0

Giorgio Gambosi

giorgio.gambosi@uniroma2.it 1

Gianluca Rossi

gianluca.rossi@uniroma2.it 1 0 Fondazione Ugo Bordoni , Rome , Italy 1 Univ. of Rome Tor Vergata , Rome , Italy

2014

2 7

This paper presents a simple approach for identifying relevant and reliable news from the Twitter stream, as soon as they emerge. The approach is based on a near-real time systems for sentiment analysis on Twitter, implemented by Fondazione Ugo Bordoni, and properly modi ed in order to detect the most representative tweets in a speci ed time slot. This work represents a rst step towards the implementation of a prototype supporting journalists in discovering and nding news on Twitter.

Microblogging is one of the most successful and widely used paradigm to communicate and interact on online social networks. According to such paradigm, users can post short messages that are publicly delivered to all their followers in real time. In particular Twitter, the most popular microblogging framework, allows to exchange messages (tweets) of most 140 chars length. This constraint is particularly suitable for posting from mobile devices, as con rmed by statistics on user access [ 8 ].

Twitter is used as a vehicle for the prompt, epidemic, di usion of news, in terms of both announceCopyright c by the paper's authors. Copying permitted only for private and academic purposes. ments and comments on topics of general interest [ 6 ], though largely applied also for conversation, chatting or exchanging updates about user activities or location, that is to exchange information valuable at a personal level. With its claimed 500 million tweets per day and more than 200 million active users per month, (source: Initial public o ering of shares of common stock of Twitter, Inc.) Twitter turns out as a primary source of timely information. Being able to discover relevant news as soon as they are announced, within the overall tweet stream, turns out to be an important issue both for journalists and for ordinary news readers.

This poses several non trivial problems: identifying emerging topics as collections of related tweets, recognizing news announcements from other types of information as soon as possible, determining their freshness to gather emerging news as quickly as possible, diversifying accounts of the latest news to avoid reporting the same information several times, evaluating the reliability of the news announcement also in terms of source trustfulness.

This paper reports the results of a experimentation aimed to develop a system able to e ectively identify and report relevant and reliable news from the Twitter stream, as soon as they emerge. The approach is based on a near-real time system for sentiment analysis on Twitter, implemented by the Fondazione Ugo Bordoni, and properly modi ed in order to detect the most representative tweets in a speci ed time-slot.

This work represents a rst step towards the implementation of a prototype supporting journalists in discovering and nding news on Twitter. To measure the e ectiveness of our algorithms we have joined the SNOW 2014 Data Challenge: the task de ned by organizers of this challenge is very suitable for our research purpose. It is worth to note, even if results of this experimentation seem to be encouraging, we consider them just a baseline for future experimentations. In fact, the e ectiveness of our strategy can be improved both by a better tuning of the system parameters and by applying more advanced techniques, such as: timeline analysis to deal with freshness of tweets; sentiment analysis to detect neutrality, as expected in news announcements; more sophisticated approaches for tweet clustering and near duplicate detection.

The paper is organized as follows: in Section 2 we brie y introduce the SNOW 2014 Data Challenge task and the related benchmark. In Section 3 we provide an architectural overview of the system implemented by the Fondazione Ugo Bordoni for near-realtime sentiment analysis on Twitter. In Section 4 we describe our approach and in Section 5 we present the result of a preliminary evaluation of our baseline. Section 6 concludes the paper. 2

Task de nition

The SNOW 2014 Data Challenge de nes a task for real-time topic detection on Twitter. More precisely, the task consists in identifying the most relevant topics in times lots of 15 minutes in the period between 2502-14 (18:00 GMT) and 26-02-14 (18:00 GMT).

The test data used in the SNOW 2014 Data Challenge is composed by about one million of tweets1 from the Twitter Stream. The ltering activity has been conducted by using the Twitter Streaming API. Tweets have been selected by monitoring four keywords (i.e. Syria, terror, Ukraine, and bitcoin) and about 5000 user accounts. Since the monitoring spanned over 24 hours, the total number of analyzed time slots were 96. For each time slot and each discovered topic, a short headline should be yielded, together a set of representative tweets, possibly URLs of pictures, and nally a set of keywords. The expected output format is the following: time-slot headline keywords tweetIds pictureUrl

With respect to the SNOW 2014 Challenge task we ful lled the task providing the following outcomes: instead of a headline summarizing the discovered topic, we return the most representative tweet for that topic and we present its tweetId as representative tweet for the tweetIds eld.

1While the SNOW 2014 Data Challenge organizers collected 1.041.062 tweets, we ltered 1.040.362 tweets. Anyway the difference, in the order of 0.067%, is not statistically signi cant.

System description

The experimentation has been conducted by using a system for near-real time sentiment analysis on Twitter. This system, developed by the Fondazione Ugo Bordoni, is based on the Terrier framework [ 9 ]. Figure 1 presents an high level architectural overview of the system.

The Twitter Stream is ltered by Twitter Connectors, that are software components using the free Twitter Streaming API. As speci ed by the Twitter Streaming API Speci cation, each connector can dene a lter composed of at most 400 keywords and 5000 user accounts. Being the usage of the API for free, the service provided by Twitter works in a beste ort fashion: as a consequence, if a lter is too much noisy (i.e. the number of tweets matching monitored keywords is too high), Twitter does not guarantees the delivery of all tweets matching conditions de ned by the connector. All tweets collected by connectors are stored into a distributed installation of MongoDB [ 5 ] . Being the platform mainly oriented to implement the sentiment analysis solution described in [ 1 ], the system includes a Web application for the manual annotation of tweets and a software component (i.e. Sentiment Analysis Dictionary Builder ) for the automatic generation of Dictionaries containing weighted opinionbearing terms. Dictionaries are used by an extended version of Terrier, speci cally implemented to support the indexing of tweets and to enable time-based mining activities on the indexed collection. The front-end of the system is provided by a Web application implementing several tools useful to perform time-based searches (e.g. search for relevance, search for freshness, search for opinions), to discover latent concepts related to a speci ed topics, providing charts, and so on. Figure 2 shows the Buzz Chart produced by the Web application with respect to the SNOW 2014 test collection.

This system has been used to join to the SNOW 2014 Challenge, simply submitting an \empty" query with respect to the desired time slots and setting the relevance sorting. The system automatically retrieves relevant tweets and representative weighting words for that time slots. In the following Section we detail our approach for the time-based topic distillation. 4

Experimentation

We have simulated a time-based distillation of tweets from Twitter streaming assuming that the test collection is unbiased by ltering keywords, although a very limited number of keywords were used to lter Twitter's rehose (e.g. Syria, terror, Ukraine, and bitcoin). In fact, due to this limited number of keywords, the collection can not be considered a unbiased sample of s r o t c e n n o C r e t t i w T

Tweets containting keys set n. 1 Tweets containing keys set n. 2 Tweets containing keys set n. 3 Tweets containing keys set n. 4 MongoDB Analytic Web Tools

Extended Terrier

Dictionaries

Web Application

for manual annotation

Annotated

Tweets Sentimental Analysis Dictionary Builder a) We have assumed to process an unbiased streaming. We have gathered all tweets into time slots of 15 minutes. Thus, we have not searched tweets by using the four original topics, but we have ltered the results just by time. b) We have used a very fast English-based lter. A d) We have used a very-light and fast Nearduplicate-detection (NDD) algorithm to remove tweets from the second pass retrieved set. In particular, two tweets are considered near-duplicate if they share a bigram of two not-stopword consecutive terms. The near duplicate tweet lower in the ranking was eliminated. We nally presented the rst three tweets per time slot. 115 - Sergio Aguero has the best minutes per goal rate in @PremierLeague history scoring on average every 115 minutes. Delivered. should be considered a news? To reduce the impact of the subjectivity, each tweet has been evaluated by three human experts and classi ed as: highly relevant (i.e. it is a news), if all three human experts agree in considering the tweet as representative for a news; not relevant (i.e it is not a news), if all three human experts agree in considering the tweet as not representative for a news; relevant (i.e. it seems a news), otherwise. Since the submitted run contains just 288 records (i.e. 3 tweets times 96 time slots), we performed a complete manual evaluation in order to evaluate the precision of our baseline. Our assessment focused on the relevance of tweets presented as representative for a news. Interestingly, we quickly realized it is not trivial to determine what should be classi ed as a news. For example: if a tweet contains the substring, it is probably a representative for a news. This hypothesis is conrmed by the resulting precision that it is equals to 0.94;

The precision of our system (i.e. P@3) varies between 0.34, if we just consider the highly relevant class, and 0.58, if we also consider the relevant one. It is worth noting these results are strongly in uenced by the choice to return exactly three tweets for each time slot. In terms of precision, this strategy can be disadvantageous when a time slot does not contain any Considering both highly relevant and relevant classes, we obtain a precision equals to 0.64 and a recall equals to 0.80.

Even if we know we performed an incomplete assessment, we believe this is an encouraging starting point for the implementation of a vertical system for time base topic detection on Twitter. The o cial evaluaif a news emerging from a tweet containing the substring it is not represented by a tweet in the submitted run, we missed the news; if a time slot does not contain any tweet containing the substring and all tweets in the submitted run in that time slot have been evaluated as \not relevant", then the time slot is not considered in the computation of precision and recall values because we do not have any evidence of the existence of a news to be discovered: this hypothesis simulates the case in which the system is able to return an empty result when a time slot does not contain any news. Applying this rule 9 time slots were removed.

last,defoe,jermain,game,dnipro,tottenham ukraine,russia,putin,troops ukraine,troops,putin,russian,news,puts,alert syria,troops,state,175,media,army 26-02-2014 13:30 26-02-2014 13:30 26-02-2014 13:30 26-02-2014 13:45 26-02-2014 13:45 In this paper we describe our approach in facing a challenging task: the time-based topic distillation from microblog. More precisely, we report about the strategy adopted to submit a preliminary baseline to the SNOW 2014 Data Challenge and we reported a rst assessment attempt. Starting from this baseline, we will explore the following research directions: a) The use of a topic-based clustering method, e.g. k-means driven by topic, or of a search-based result set to further split each time slot into homogeneous clusters. b) The ltering of tweets by sentiment polarity. Sentimental analysis can be indeed useful to detect neutral tweets, since we assume that breaking news do not in general contain opinions or sentiment polarities, unless the news quotes other people's statements. c) Freshness and tweet peak analysis improves retrieval quality [ 2 ]. The best representative for bitcoin,founder,new,mt,still,gox,karpeles,mark each time-based cluster can be further selected taking into account topic relevance, diversity and freshness, not just by diversity and relevance as we have done with our baseline. Zipf-law, other fat-tailed distributions [ 2 ], or exponential decaying function [ 7 ] can enhance early precision. At the moment we have not used any time-based retrieval function to order or select the tweet representatives of the selected news. d) The NDD algorithm was very restrictive that only a few tweets were selected among the topmost relevant retrieved ones. For this reason we have decided to select only a small number of tweets per each time slot. If we had used a less aggressive Near-Duplicate Detection method, for example with Jaccard's coe cient instead of a simple bigram sharing condition, then we would have the possibility to produce a longer list of relevant and diverse news. Diversity requires thus a renement of NDD in combination with freshness and topic relevance. Because of the too restrictive NDD condition between tweets we have not produced the list of near duplicate candidate for each selected tweet. The use of min-wise independent permutations for NDD [ 4 ] for Twitter search can be easily handled with the use of k-grams with k greater or equal to three, even without the use of sophisticated similarity functions such as Jaccard's one. In fact, due to the shortness of messages (a tweet contains 13 words on average), there is a high probability of near duplicates to share only one k-gram in a short slot of time. Obviously such tight condition would be too restrictive for larger collections and more importantly without referencing near duplicates to very short periods of time. We have thus singled out easily duplicates not only by removing the tweets containing the RT word, but also removing tweets sharing any k-gram. In order to be more selective in the initial ranking, we have further relaxed this condition to bigrams (that include entities such Mark Karpeles, western Russia etc. on Table 1), but at the moment we cannot evaluate the corresponding produced loss in recall. 7

Acknowledgments

Fondazione Ugo Bordoni carried out this work in collaboration with Almawave.

[1]

Amati , E. Ambrosi,

Bianchi ,

Gaibisso , and

Gambosi . Automatic construction of an opinion-term vocabulary for ad hoc retrieval . In C. Macdonald,

Ounis ,

Plachouras , I. Ruthven , and R. W. White, editors, ECIR , volume 4956 of Lecture Notes in Computer Science, pages 89 { 100 . Springer, 2008 .

[2]

Amati , G. Amodeo, and

Gaibisso . Survival analysis for freshness in microblogging search . In X. wen Chen, G. Lebanon,

Wang , and M. J. Zaki, editors, CIKM , pages 2483 { 2486 . ACM, 2012 .

[3]

Amodeo , G. Amati, and

Gambosi . On relevance, time and query expansion . In Proceedings of the 20th ACM international conference on Information and knowledge management , CIKM '11 , pages 1973 { 1976 , New York, NY, USA, 2011 .

[4]

A. Z.

Broder and

Mitzenmacher . Completeness and robustness properties of min-wise independent permutations . Random Struct. Algorithms , 18 ( 1 ): 18 { 30 , 2001 .

[5]

Chodorow. MongoDB: The De nitive Guide. O'Reilly Media , 2013 .

[6]

Java ,

Song ,

Finin , and

Tseng . Why we twitter: understanding microblogging usage and communities . WebKDD/SNA-KDD'07 , 2007 .

[7]

Li and

W. B.

Croft . Time-based language models . In Proceedings of the twelfth international conference on Information and knowledge management , CIKM '03 , pages 469 { 475 , New York, NY, USA, 2003 . Acm.

[8]

Lunden . Mobile twitter: 164m+ (75%) access from handheld devices monthly, 65% of ad sales come from mobile . http://techcrunch.com/ 2013 /10/03/mobiletwitter-161m -access-from-handheld-devices-eachmonth-65-of-ad-revenues-coming-from-mobile/.

[9]

Ounis ,

Amati ,

Plachouras ,

He ,

Macdonald , and

Johnson . Terrier information retrieval platform . In D. E. Losada and J. M. Fernandez-Luna, editors, ECIR , volume 3408 of Lecture Notes in Computer Science, pages 517 { 519 . Springer, 2005 .

[10]

Papadopoulos ,

Corney , and

L. M.

Aiello . Snow 2014 data challenge: Assessing the performance of news topic detection methods in social media . In Proceedings of the SNOW 2014 Data Challenge , 2014 .