=Paper=
{{Paper
|id=Vol-1150/amati
|storemode=property
|title=Time-based Microblog Distillation
|pdfUrl=https://ceur-ws.org/Vol-1150/amati.pdf
|volume=Vol-1150
|dblpUrl=https://dblp.org/rec/conf/www/AmatiA0GR14
}}
==Time-based Microblog Distillation==
Time-based Microblog Distillation Giambattista Amati Simone Angelini Marco Bianchi Fondazione Ugo Bordoni Fondazione Ugo Bordoni Fondazione Ugo Bordoni Rome, Italy Rome, Italy Rome, Italy gba@fub.it sangelini@fub.it mbianchi@fub.it Giorgio Gambosi Gianluca Rossi Univ. of Rome Tor Vergata Univ. of Rome Tor Vergata Rome, Italy Rome, Italy giorgio.gambosi@uniroma2.it gianluca.rossi@uniroma2.it ments and comments on topics of general interest [6], though largely applied also for conversation, chatting Abstract or exchanging updates about user activities or loca- tion, that is to exchange information valuable at a This paper presents a simple approach for personal level. With its claimed 500 million tweets per identifying relevant and reliable news from the day and more than 200 million active users per month, Twitter stream, as soon as they emerge. The (source: Initial public offering of shares of common approach is based on a near-real time sys- stock of Twitter, Inc.) Twitter turns out as a primary tems for sentiment analysis on Twitter, im- source of timely information. Being able to discover plemented by Fondazione Ugo Bordoni, and relevant news as soon as they are announced, within properly modified in order to detect the most the overall tweet stream, turns out to be an impor- representative tweets in a specified time slot. tant issue both for journalists and for ordinary news This work represents a first step towards readers. the implementation of a prototype supporting This poses several non trivial problems: identifying journalists in discovering and finding news on emerging topics as collections of related tweets, recog- Twitter. nizing news announcements from other types of infor- mation as soon as possible, determining their freshness 1 Introduction to gather emerging news as quickly as possible, diver- Microblogging is one of the most successful and widely sifying accounts of the latest news to avoid reporting used paradigm to communicate and interact on online the same information several times, evaluating the re- social networks. According to such paradigm, users liability of the news announcement also in terms of can post short messages that are publicly delivered to source trustfulness. all their followers in real time. In particular Twitter, This paper reports the results of a experimentation the most popular microblogging framework, allows to aimed to develop a system able to effectively identify exchange messages (tweets) of most 140 chars length. and report relevant and reliable news from the Twit- This constraint is particularly suitable for posting from ter stream, as soon as they emerge. The approach is mobile devices, as confirmed by statistics on user ac- based on a near-real time system for sentiment anal- cess [8]. ysis on Twitter, implemented by the Fondazione Ugo Twitter is used as a vehicle for the prompt, epi- Bordoni, and properly modified in order to detect the demic, diffusion of news, in terms of both announce- most representative tweets in a specified time-slot. This work represents a first step towards the im- Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. plementation of a prototype supporting journalists in In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings discovering and finding news on Twitter. To measure of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014, the effectiveness of our algorithms we have joined the published at http://ceur-ws.org SNOW 2014 Data Challenge: the task defined by orga- nizers of this challenge is very suitable for our research 3 System description purpose. It is worth to note, even if results of this The experimentation has been conducted by using a experimentation seem to be encouraging, we consider system for near-real time sentiment analysis on Twit- them just a baseline for future experimentations. In ter. This system, developed by the Fondazione Ugo fact, the effectiveness of our strategy can be improved Bordoni, is based on the Terrier framework [9]. Fig- both by a better tuning of the system parameters and ure 1 presents an high level architectural overview of by applying more advanced techniques, such as: time- the system. line analysis to deal with freshness of tweets; sentiment The Twitter Stream is filtered by Twitter Con- analysis to detect neutrality, as expected in news an- nectors, that are software components using the free nouncements; more sophisticated approaches for tweet Twitter Streaming API. As specified by the Twitter clustering and near duplicate detection. Streaming API Specification, each connector can de- The paper is organized as follows: in Section 2 we fine a filter composed of at most 400 keywords and briefly introduce the SNOW 2014 Data Challenge task 5000 user accounts. Being the usage of the API for and the related benchmark. In Section 3 we provide free, the service provided by Twitter works in a best- an architectural overview of the system implemented effort fashion: as a consequence, if a filter is too much by the Fondazione Ugo Bordoni for near-realtime sen- noisy (i.e. the number of tweets matching monitored timent analysis on Twitter. In Section 4 we describe keywords is too high), Twitter does not guarantees the our approach and in Section 5 we present the result delivery of all tweets matching conditions defined by of a preliminary evaluation of our baseline. Section 6 the connector. All tweets collected by connectors are concludes the paper. stored into a distributed installation of MongoDB [5] . Being the platform mainly oriented to implement the sentiment analysis solution described in [1], the system 2 Task definition includes a Web application for the manual annotation of tweets and a software component (i.e. Sentiment The SNOW 2014 Data Challenge defines a task for Analysis Dictionary Builder ) for the automatic gen- real-time topic detection on Twitter. More precisely, eration of Dictionaries containing weighted opinion- the task consists in identifying the most relevant topics bearing terms. Dictionaries are used by an extended in times lots of 15 minutes in the period between 25- version of Terrier, specifically implemented to support 02-14 (18:00 GMT) and 26-02-14 (18:00 GMT). the indexing of tweets and to enable time-based min- ing activities on the indexed collection. The front-end The test data used in the SNOW 2014 Data Chal- of the system is provided by a Web application im- lenge is composed by about one million of tweets1 plementing several tools useful to perform time-based from the Twitter Stream. The filtering activity has searches (e.g. search for relevance, search for fresh- been conducted by using the Twitter Streaming ness, search for opinions), to discover latent concepts API. Tweets have been selected by monitoring four related to a specified topics, providing charts, and so keywords (i.e. Syria, terror, Ukraine, and bitcoin) on. Figure 2 shows the Buzz Chart produced by the and about 5000 user accounts. Since the monitoring Web application with respect to the SNOW 2014 test spanned over 24 hours, the total number of analyzed collection. time slots were 96. For each time slot and each This system has been used to join to the SNOW discovered topic, a short headline should be yielded, 2014 Challenge, simply submitting an “empty” query together a set of representative tweets, possibly with respect to the desired time slots and setting the URLs of pictures, and finally a set of keywords. The relevance sorting. The system automatically retrieves expected output format is the following: relevant tweets and representative weighting words for that time slots. In the following Section we detail our time-slot headline keywords tweetIds pictureUrl approach for the time-based topic distillation. With respect to the SNOW 2014 Challenge task we fulfilled the task providing the following outcomes: in- 4 Experimentation stead of a headline summarizing the discovered topic, we return the most representative tweet for that topic We have simulated a time-based distillation of tweets and we present its tweetId as representative tweet for from Twitter streaming assuming that the test collec- the tweetIds field. tion is unbiased by filtering keywords, although a very limited number of keywords were used to filter Twit- 1 While the SNOW 2014 Data Challenge organizers collected ter’s firehose (e.g. Syria, terror, Ukraine, and bitcoin). 1.041.062 tweets, we filtered 1.040.362 tweets. Anyway the dif- In fact, due to this limited number of keywords, the ference, in the order of 0.067%, is not statistically significant. collection can not be considered a unbiased sample of Tweets containting keys set n. 1 Twitter Connectors Tweets containing keys set n. 2 Web Application MongoDB for manual annotation Tweets containing keys set n. 3 Tweets containing keys set n. 4 Annotated Tweets Sentimental Analysis Analytic Web Tools Extended Dictionaries Terrier Dictionary Builder Figure 1: Architectural overview of our system for near-realtime sentiment analysis on Twitter Twitter’s firehose (about the 30% of tweets in the col- set of English common terms was submitted as lection contains the above mentioned terms). As a an unweighted query to the system and it was consequence, any methodology exploiting the knowl- searched against the inverted index in order to edge of these keywords could have been easily affected produce a first pass retrieval. This lexicon was by overfitting. For a similar reason, and the sake of used to eliminate not-English tweets from the generality, we have not used the list of seed authors streaming. We have used a stopword list of 453 for filtering the news, as well as to assume ad hoc so- English words as a query to filter tweets writ- lutions for the type of task, such as the use of specific ten in English, and thus reducing the collection or clue keywords to detect the breaking news (e.g. the to 94.10% of the original size. The error rate of word BREAKING at the beginning of the tweets) or not-English tweets after retrieval was 9.03% in the the id of the seed authors. However, we have deliber- sample of submitted list to the SNOW competi- ately removed retweets from the retrieved set because tion. We have not yet statistics on the error rate this was an explicit requirement of the task. for the false negative not-retrieved set. Since we did not have criteria or evaluation measure to assess the quality of the result set, the values of the parameters were given either by default as provided by c) Though we have not used a query for the first our system or by empirical evidence. The training set pass retrieval, we have ranked the tweets of each thus was only used to qualitatively provide an evalua- time slot by relevance using a query expansion tion of the distilled result set of tweets, and it was not technique. We have applied the Bose-Einstein used to tune the parameters of the filtering system. query expansion weights to determine the new We have submitted a run that will be used as a term queries. Bose-Einstein (BO) weight is a vari- benchmark to evaluate our future experiments, on the ant of the Kullback-Leibler divergence (KL) and basis of the evaluation measures that will be provided is preferred to KL when recall is more important by the organizers [10]. For this reason we consider the than early precision, as required in our case by the submitted run just as a baseline for future experimen- absence of a topic-based first pass retrieval [3]. tations. To submit our baseline we have tackled the follow- ing issues: d) We have used a very-light and fast Near- a) We have assumed to process an unbiased stream- duplicate-detection (NDD) algorithm to remove ing. We have gathered all tweets into time slots of tweets from the second pass retrieved set. In par- 15 minutes. Thus, we have not searched tweets by ticular, two tweets are considered near-duplicate using the four original topics, but we have filtered if they share a bigram of two not-stopword con- the results just by time. secutive terms. The near duplicate tweet lower in the ranking was eliminated. We finally presented b) We have used a very fast English-based filter. A the first three tweets per time slot. Figure 2: The Buzz chart shows the number of tweets of the SNOW 2014 test collection 5 Evaluation news. Furthermore, we tried to improve the compu- tation of the precision value and to get a sense of the Since the submitted run contains just 288 records (i.e. performance in terms of recall: for each time slot we 3 tweets times 96 time slots), we performed a com- have extracted all tweets containing the term BREAK- plete manual evaluation in order to evaluate the pre- ING, as substring, from the result set used to create cision of our baseline. Our assessment focused on the the submitted run. In this case we assume: relevance of tweets presented as representative for a news. Interestingly, we quickly realized it is not trivial • if a tweet contains the substring, it is probably a to determine what should be classified as a news. For representative for a news. This hypothesis is con- example: firmed by the resulting precision that it is equals to 0.94; 115 - Sergio Aguero has the best minutes per goal rate in @PremierLeague history scoring on average • if a news emerging from a tweet containing the every 115 minutes. Delivered. substring it is not represented by a tweet in the submitted run, we missed the news; should be considered a news? To reduce the impact of the subjectivity, each tweet has been evaluated by • if a time slot does not contain any tweet contain- three human experts and classified as: ing the substring and all tweets in the submitted run in that time slot have been evaluated as “not • highly relevant (i.e. it is a news), if all three hu- relevant”, then the time slot is not considered in man experts agree in considering the tweet as rep- the computation of precision and recall values be- resentative for a news; cause we do not have any evidence of the exis- • not relevant (i.e it is not a news), if all three hu- tence of a news to be discovered: this hypothesis man experts agree in considering the tweet as not simulates the case in which the system is able to representative for a news; return an empty result when a time slot does not contain any news. Applying this rule 9 time slots • relevant (i.e. it seems a news), otherwise. were removed. The precision of our system (i.e. P@3) varies be- Considering both highly relevant and relevant classes, tween 0.34, if we just consider the highly relevant class, we obtain a precision equals to 0.64 and a recall equals and 0.58, if we also consider the relevant one. It is to 0.80. worth noting these results are strongly influenced by Even if we know we performed an incomplete assess- the choice to return exactly three tweets for each time ment, we believe this is an encouraging starting point slot. In terms of precision, this strategy can be dis- for the implementation of a vertical system for time advantageous when a time slot does not contain any base topic detection on Twitter. The official evalua- Table 1: An extract of records submitted to the SNOW 2014 Challenge. Times-lot Headline Tags Mt. Gox Founder Issues New Statement, 26-02-2014 13:30 Says He’s Still in Japan: Mark Karpeles, bitcoin,founder,new,mt,still,gox,karpeles,mark founder of troubled bitcoin exc... Jermain Defoe has played his last game 26-02-2014 13:30 for Tottenham after being ruled out of last,defoe,jermain,game,dnipro,tottenham tomorrows match against Dnipro Putin puts troops in western Russia on alert 26-02-2014 13:30 amid Ukraine tension: President ukraine,russia,putin,troops Vladimir Putin ordered an urgent ... [GMA News] Alarm in Ukraine as Putin puts 26-02-2014 13:45 Russian troops on alert: SIMFEROPOL, ukraine,troops,putin,russian,news,puts,alert Ukraine/KIEV - President Vladim... Syria State Media Says Army Kills 175 Rebels: 26-02-2014 13:45 State media in Syria says army troops have syria,troops,state,175,media,army killed 175 rebels in an ambush south of Dama... Table 2: An extract of tweets containing the term BREAKING as substring. Time-slot Headline Tags BREAKING: Reports say Vladimir Putin has 26-02-2014 13:00 ordered a test of combat readiness of troops in ukraine,troops,putin,russia,combat,readiness central, western Russia.”Ukraine bound! BREAKING: State media in Syria says army 26-02-2014 13:30 troops have killed 175 rebels in an ambush syria,troops,damascus south of Damascus. BREAKING: Mid Staffs NHS trust to be 26-02-2014 14:15 - dissolved, Jeremy Hunt announces tion results of our method in the Data Challenge are each time-based cluster can be further selected included in [10]. taking into account topic relevance, diversity and freshness, not just by diversity and relevance as 6 Conclusion and Future Work we have done with our baseline. Zipf-law, other fat-tailed distributions [2], or exponential decay- In this paper we describe our approach in facing a ing function [7] can enhance early precision. At challenging task: the time-based topic distillation from the moment we have not used any time-based re- microblog. More precisely, we report about the strat- trieval function to order or select the tweet repre- egy adopted to submit a preliminary baseline to the sentatives of the selected news. SNOW 2014 Data Challenge and we reported a first assessment attempt. Starting from this baseline, we will explore the following research directions: d) The NDD algorithm was very restrictive that only a few tweets were selected among the topmost a) The use of a topic-based clustering method, e.g. relevant retrieved ones. For this reason we have k-means driven by topic, or of a search-based re- decided to select only a small number of tweets sult set to further split each time slot into homo- per each time slot. If we had used a less ag- geneous clusters. gressive Near-Duplicate Detection method, for ex- b) The filtering of tweets by sentiment polarity. Sen- ample with Jaccard’s coefficient instead of a sim- timental analysis can be indeed useful to detect ple bigram sharing condition, then we would have neutral tweets, since we assume that breaking the possibility to produce a longer list of relevant news do not in general contain opinions or sen- and diverse news. Diversity requires thus a re- timent polarities, unless the news quotes other finement of NDD in combination with freshness people’s statements. and topic relevance. Because of the too restric- tive NDD condition between tweets we have not c) Freshness and tweet peak analysis improves re- produced the list of near duplicate candidate for trieval quality [2]. The best representative for each selected tweet. The use of min-wise indepen- dent permutations for NDD [4] for Twitter search ings of the 20th ACM international conference on can be easily handled with the use of k-grams Information and knowledge management, CIKM with k greater or equal to three, even without ’11, pages 1973–1976, New York, NY, USA, 2011. the use of sophisticated similarity functions such Acm. as Jaccard’s one. In fact, due to the shortness of messages (a tweet contains 13 words on average), [4] A. Z. Broder and M. Mitzenmacher. Complete- there is a high probability of near duplicates to ness and robustness properties of min-wise in- share only one k-gram in a short slot of time. Ob- dependent permutations. Random Struct. Algo- viously such tight condition would be too restric- rithms, 18(1):18–30, 2001. tive for larger collections and more importantly [5] K. Chodorow. MongoDB: The Definitive Guide. without referencing near duplicates to very short O’Reilly Media, 2013. periods of time. We have thus singled out easily duplicates not only by removing the tweets con- [6] A. Java, X. Song, T. Finin, and B. Tseng. Why we taining the RT word, but also removing tweets twitter: understanding microblogging usage and sharing any k-gram. In order to be more selective communities. WebKDD/SNA-KDD’07, 2007. in the initial ranking, we have further relaxed this [7] X. Li and W. B. Croft. Time-based language mod- condition to bigrams (that include entities such els. In Proceedings of the twelfth international Mark Karpeles, western Russia etc. on Table 1), conference on Information and knowledge man- but at the moment we cannot evaluate the corre- agement, CIKM ’03, pages 469–475, New York, sponding produced loss in recall. NY, USA, 2003. Acm. 7 Acknowledgments [8] I. Lunden. Mobile twitter: 164m+ (75%) Fondazione Ugo Bordoni carried out this work in col- access from handheld devices monthly, laboration with Almawave. 65% of ad sales come from mobile. http://techcrunch.com/2013/10/03/mobile- References twitter-161m-access-from-handheld-devices-each- month-65-of-ad-revenues-coming-from-mobile/. [1] G. Amati, E. Ambrosi, M. Bianchi, C. Gaibisso, and G. Gambosi. Automatic construction of [9] I. Ounis, G. Amati, V. Plachouras, B. He, C. Mac- an opinion-term vocabulary for ad hoc retrieval. donald, and D. Johnson. Terrier information re- In C. Macdonald, I. Ounis, V. Plachouras, trieval platform. In D. E. Losada and J. M. I. Ruthven, and R. W. White, editors, ECIR, vol- Fernández-Luna, editors, ECIR, volume 3408 of ume 4956 of Lecture Notes in Computer Science, Lecture Notes in Computer Science, pages 517– pages 89–100. Springer, 2008. 519. Springer, 2005. [2] G. Amati, G. Amodeo, and C. Gaibisso. Survival [10] S. Papadopoulos, D. Corney, and L. M. Aiello. analysis for freshness in microblogging search. In Snow 2014 data challenge: Assessing the perfor- X. wen Chen, G. Lebanon, H. Wang, and M. J. mance of news topic detection methods in social Zaki, editors, CIKM, pages 2483–2486. ACM, media. In Proceedings of the SNOW 2014 Data 2012. Challenge, 2014. [3] G. Amodeo, G. Amati, and G. Gambosi. On rel- evance, time and query expansion. In Proceed-