TwiBiNG: A Bipartite News Generator Using Twitter Yashvardhan Sharma Divyansh Bhatia Department of Computer Science Department of Computer Science Birla Institute of Technology & Science Birla Institute of Technology & Science Pilani, India 333 031 Pilani, India 333 031 yash@pilani.bits-pilani.ac.in h2009399@pilani.bits-pilani.ac.in Vivek Kishore Choudhary Department of Computer Science Birla Institute of Technology & Science Pilani, India 333 031 f2012650@pilani.bits-pilani.ac.in public. With the advent of Web 2.0 most of the jour- nalism has gone the online way innovating the term Abstract ”Online Journalism”. Since users of the web are ready to share each and every activity they do in their lives Online Journalism is being seen as future of due to the free nature of the world, this has made Journalism. News Professionals are vying to professionals content hungry. Twitter generates an capture newsworthy stories that emerge from amount of information that can outrun the storage crowd. Live Social Media especially Twitter space of many servers in a few months. Developing is generating enormous volumes of data every a user centered tool that can process this information minute. It becomes difficult to select credi- in real time has become need of the day for professional ble and relevant tweets that may form quality journalists. news among others. The problem intensifies From the Arab Spring to the Oscars 2014 Selfie due to the freedom of Twitter being an infor- tweets have changed the way the world shares infor- mal language. Generating headlines by solv- mation. Scholars today can predict election results ing this problem may still not be relevant and better than ever before [Ocon10]. The ”#” Hashtag may face the question of authenticity. Given a feature in Twitter has made event stories easier to cap- set of keywords and a time period this problem ture [Zan11]. As a result social network mining, orig- becomes manageable and can be solved effi- inally loaded with clustering and classification of on- ciently. We propose a bipartite algorithm that line worlds, is leveraging on understanding evolution clusters authentic tweets based on key phrases of real-world events [Dom05].Adding another feather and ranks the clusters based on trends in each to its cap is the fact that newspaper and magazines timeslot. Finally, we present an approach to have started publishing content on social media sites select those topics which have sufficient con- like Twitter and Facebook. To summarize, news no tent to form a story longer breaks it tweets (Solis)[Sol10]. The goal of this paper is to demonstrate the use 1 Introduction of Twitter to monitor headlines online and generate news stories. We propose a standalone system TwiB- Journalism is the state of art that disseminates infor- iNG to extract tweets related to user defined keywords mation and provides analysis of news to the general and propose ranked news summaries based on trend Copyright © by the paper’s authors. Copying permitted only and relevance of tweets they contain. The key novelty behind TwiBiNG is generation of Bi-partitite clusters for private and academic purposes. In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings of tweet intentions and use of Longest common-sub- of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014, sequence(LCS) algorithm along with a few tweet cre- published at http://ceur-ws.org ator’s details to separate relevant tweets from irrele- vant ones. This approach not only produces better characters) and informal text are some issues that pose clusters but also generates stories that are authentic, problems to many text mining researchers (Hong and contains less spam and more importantly are distinct Davison [Hon10]). Bollen et al. [Bol11] used terms ex- from each other. Also since we base our approach pressing positive and negative behavior for sentiment on intention of tweets it makes it language indepen- analysis on Twitter. dent. Readers should note that by intention we refer Text Clustering is another where scholars have to the general subject of tweet; not the intention of worked for content analysis. Goyal and Mehala the user posting it. The selected datasets were de- [Goy13] presented an approach to find conceptually re- veloped from tweets collected between Tue 25 Feb, lated queries by clustering on bipartite and tripartite 18:00 GMT and Wed 26 Feb, 18:00 GMT based on graphs. We try to propose a similar approach for Twit- keywords ”Syria”,”Ukraine”,”Terror”,”Bitcoin”. We ter content analysis using Bipartite graph. [Aie13] pro- collected 1,041,062 unique tweets from 556,295 users poses trend based tweet clustering approaches. We which included 648,651 retweets and 135,141 replies. present an approach that uses a modified BNgram The crawl also included messages sent from or to a set clustering approach, which has motivation from orig- of around 5000 journalists/commentators. inal approach of [Aie13]. Phuvipadawat and Murata In short our contributions can be summarized as: [Phu10] present a breaking news prediction algorithm that clusters tweets based on First Story detection af- ˆ We incorporated retweets in BNgrams clustering ter segmenting different stories. TwitterStand [San09] [Aie13] and hence improved upon the trend rank- develops a ”leader-follower” text clustering algorithm. ing of keywords. 2.2 Natural Language Processing ˆ We clustered our tweets based on bipartitite graph thereby clubbing similar intention tweets Headline Generation has been active area of research together. among NLP researchers. Most of the scholars work here by selecting a proper set of keywords and finding ˆ We reduced the effect of informal text in Twitter a way to combine them in a way that forms a gram- by using LCS based similarity score while dealing matically coherent and meaningful sentence. In Banko with keywords. et al.[Ban00] authors present a statistical approach to ˆ We presented news headlines by ranking clustered term selection and term ordering process that depicts tweets based on relevance to the clustered key- the power of non-extractive summarization whereas word set and use ‘Part Of Speech’ tagger to make Jin and Hauptman [Jin01] presents an approach for them readable. extractive summarization along with a Bayesian ap- proach. They also discuss various issues in keyword se- The remainder of the paper is organized as follows: In lection for headline generation. We use Part of speech Section 2 we take a look at existing algorithms and ap- tagging along with most relevant tweet identification proaches.Section 3 details about proposed methodolo- to generate meaningful user readable headline. gies and approaches. Section 4 provides a discussion of results. Section 5 concludes the work by laying a 3 Methodology foundation for future work. We divide our process in four phases 1) Data prepara- tion, 2) Data Clustering 3) Cluster Ranking, 4) Tweet 2 Related Work Ranking and Headline generation. We will now de- The work of generating headlines using social media scribe our TwiBiNG system phase by phase: can be seen as a combination of two branches 1) Infor- mation Retrieval and Text Mining and 2) Natural Lan- 3.1 Data Preparation guage Processing. Scholars have worked extensively on Once the data set for a given timeslot is ready by ex- Twitter data using both the fields. Here we present an tracting tweets related to a given set of seeds and key- overview of existing approaches in both fields: words, we tag entities in tweets using Stanford’s Part- of-Speech Tagger and extract nouns, HashTags, Users. 2.1 Text Mining on Twitter Content We ignore other parts of speech, thereby concentrating Twitter has its own conventions for language while more on the subject than the predicate. This is be- (@) is used to mention user, (#) is used to identify cause in a given timeslot, it is difficult for predicate to events and ”RT” is used to represent a retweet. Bifet change rapidly for the same subject while the reverse and Frank [Bif10] use these features for opinion min- may not be true. These tagged words are referred as ing. Zhao et al.[Zha11] develop a Twitter-LDA model key phrases (KP) from now on. We now decide on through content analysis. The restricted length (140 trending keywords. We rank keywords using a modified df-idft [Aie13] If any LCS(Si , Ui ) contains Ui then we include all the score by incorporating retweets: tweets related to Si in set < T Ui > which contains Ri −Ri−1 tweet ids related to user centered keywords. We scan R(ki ) = max(Ri ,Ri−1 ) the database for the timeslot again and remove those Score(ki ) = ti ∗ log(1 + R(ki +1) tweets which are not contained in < T Ui > (user- ti−1 +1 ) centric tweets). At the end of this stage we end up Here Ri represents number of retweets for keyword k with a set of tweets and related keywords that can be in timeslot i and ti represents number of tweets for considered authentic for a news story. keyword k. Since a keyword may be related to un- bounded number of tweets and retweets in a timeslot 3.2 Intention based Tweet Clustering deciding on threshold is difficult. Therefore, we de- We use the approach used in [Goy13] to use bipartite cided to normalize the score for each keyword using clustering of tweets. The basic aim here is to get real min-max normalization. Let < K > be the set of intention of tweets in clusters. Algorithm 1 presents tweets in a slot i then normalized score is given by: an incremental bipartite algorithm to cluster tweets N ormalizedScore(N Ki ) = and keywords. Once we have a set of clusters we know the intention of tweets. As can be seen the threshold Score(ki ) − min(Score(< K >)) is kept > 0.5, which signifies that keywords merged max(Score(< K >)) − min(Score(< K >)) should have an intention similarity of more than 50%. The threshold for these normalized keywords was de- Readers requiring more specific tweets to be clustered cided to be 0.0075 through experiments. We select the together may increase the similarity but this comes at keywords above this threshold and store them in a set a cost of duplicate tweets being merged together. As (Si ). We observed that for each timeslot at this thresh- can be observed in Algorithm 1, since the clustering old we get around 800-875 trending keywords. Once is on basis of basis of Intersection(Ti ,Tj ) there will be this set was ready we assigned tweets to each keyword, duplicate tweets in cluster but a news story contain- i.e. we reversed the bipartite graph of Figure 1. We ing a lot of duplicate tweets would be considered of now filter the tweets based on user details specifically poor quality. So removing duplicate content becomes number of followers and status counts. This step is a prime task now. necessary in order to increase authenticity and reduce Data: I< Si , < T Si >> Si and T Si denotes a set tweets containing spamming content. Since clustering of keywords and related tweets is based on tweet intention, not performing the previ- Result: O< CSi , < CT Si >> clustered set of ous step may hamper clustering performance. Also the tweets generated stories may not be considered quality news. Let S: represent set of unique keywords Our experiments based on (Hutto et. al. [Hut13]) de- while clusters exist with similarity > threshold do cided that users with a follower count>600 and tweet flag=0; count>6000 may be considered authentic and consid- while si in S do ering tweets by these users alone will significantly im- j=i+1; prove system performance. while tj in T do Sim(si ,sj ) =Intersection(T si ,T sj )/Union(T si ,T sj ); if Sim (si , sj ) > 0.5 then I< si , < T si >> = I< si = sj , < U nion(T si , T sj ) >> Remove sj from I flag=1; end end if flag=0 then b end Now since we are building a user centered news gen- reak; erator we want tweets related to the keywords defined end end by user to improve relevancy. For this purpose we scan all keywords in (Si ) and compute their Similarity with Algorithm 1: Bipartite Clustering of Tweets using user-defined keywords (Ui ). Keywords LCS(Si , Ui ) = LongestCommonSubsequence(Si , Ui ) In Algorithm 2 we present an algorithm to remove du- plicate tweets from cluster: has its Motivation from BNgram clustering approach Data: < CSi , < CT Si >> Set of tweets in a cluster used in [Aie13]. Readers can think of T CSi as a boost of keywords CSi factor for relevance. Result: : < CSi , < F T Si >> Final Set of tweets ClusterScore(CScri ) = RCSi ∗ T CSi and clusters while csi in CSi do We now rank the clusters based on (CScri ). At the while ti in CT Si do end of this phase we have ranked our clusters and to j=i+1 avoid any confusion further we now refer them as < if < Di >.contains< tj > = false then CSir , < F T Sir >>. while tj in CT Sj do sim(ti , tj )= 3.4 Tweet Ranking in Clusters LCS(ti , tj )/Min(ti .length,tj .length) if sim(ti , tj ) > 0.65 then Now once clusters are ranked we need to rank tweets < Di >.add(tj ); contained in them in order to present them in most rel- end evant order. Before introducing ranking calculations end we need to introduce expanded keyword set. This can end be seen as a prerequisite in the step of headline forma- end < F T Si > = < CT Si >-< Di >; tion. This step is necessary and relevant since some of < CSi , < CT Si >> =< CSi , < F T Si >> the clusters may contain a small number of keywords end and need sufficient information to generate a story. We represent the expanded cluster set as < ECSi > . Let Algorithm 2: To remove Duplicate Tweets from Clus- set < Kt > represent set of keywords for tweet Ti . ter Then relevance score for Ti is calculated as The motivation behind threshold of 0.65 in Algo- Intersection(< Kt >, < ECSi >) rithm 2 can be observed in O’Connor [Oco10]. We Score(T i) = U nion(< Kt >, < ECSi >) end this phase with a cluster of keywords and their relevant set of tweets. So now we know the intention Now we rank our tweets based on Score(Ti ). At the of our keywords and we are ready to rank them. end of this phase, we filter out tweets which have a score(Ti ) ¡ 0.3. The threshold 0.3 is based on the 3.3 Cluster Ranking results of our experiments, as described in Table 2. Up until this phase we have obtained required set of Increasing the threshold provides better quality sto- clusters. We now need to rank them. Although differ- ries but reduces the number of stories at a high rate. ent authors [Yaj12][Hav03][Shu11] have proposed effi- Hence, readers requiring more focused stories may in- cient topic ranking methods they have a common fea- crease the threshold. ture that relevance to considered keywords is consid- ered an important issue. We make use of this fact and 3.5 Cluster Selection and Headline Genera- of normalized trend score to generate a ranking score tion for clusters. Since we are vying for a user centric tool In this phase we provide an approach to decide which our clusters should be most relevant to their inten- clusters can form news. As can be observed not all tion. Also since we have to generate headlines trend clusters form a story, we must judiciously decide on needs a special attention. Keeping the above two facts clusters to form news. By experiments, we observed we present our cluster ranking methodology. Using the following Heuristic may be used to select quality < Ui > we collected tweets for relevant keywords in clusters: H3.5.1: Those clusters tend to form quality section 3.1 as set < T Ui >. We calculate Relevancy of stories which contain at least four keywords, one Hash- cluster CSi having tweets < F Si > as: tag keyword, and is related to at least three tweets .Further , number of non Hashtag keywords should be RCSi = Relevancy(CSi ) = M ax(Intersection(U U nion(Ui ,F Si ) i ,F Si ) more than Hashtag keywords. The rationale behind this approach can be ex- This relevancy score gives us an indication about the plained. The clusters having excessive amounts of relation of cluster to the user’s intention. hashtags as keywords are usually related to tweets with T CSi = T rend(CSi ) = e−M ax(N ormalizedScoreof CSi) almost similar content. Having a hashtag allows users to easily identify events and more than three distinct This factor indicates that how much a cluster is trend- tweets allows us to form a sequence of events. Since, ing. The idea of taking Max(Normalized Score of CSi ) we are needed to identify a fixed number of topics, we follow H3.5.1 and scan all the clusters in < Csir > up are covered, but only the most relevant are shown for until the specified number of clusters in each timeslot. clarity.These results show an improved performance Hence, we follow a dynamic approach that is indepen- over previously existing systems. A limitation of this dent of cluster count. system is not including user’s community which may For Headline Generation we order the keywords in have allowed us to form tripartite clustering, thereby accordance to top ranked tweet in cluster and use POS improving clustering quality at a low cost. Use of bet- tagger to connect the keywords. We believe that better ter known String matching algorithms may improve approaches to form headlines exist, but we were deal- cluster quality. Our use of bipartite clustering algo- ing with informal language so we need to take support rithm can allow future researchers to explore more into from tweet intent to form them. Readers may improve this field. upon this aspect by considering statistical techniques mentioned in section 2.2. 5 Acknowledgement Authors owe a debt of gratitude to Dr. P. Goyal and 4 Results and discussion Dr. N. Mehala for their constructive criticism and innovative ideas that formed the foundation of this Table 1 depicts human evaluation of results as car- study. We would like to extend special thanks Birla ried out by authors. The official evaluation results Institute of Technology and Science for providing re- of our method in the Data Challenge are included sources without which this work would never have been completed. We would like to thank SNOW’14 orga- in snow2014dc [Pap14]. The language content shows nizers for giving us a chance to work on social sensor that our topics were evenly distributed between En- project and for their immediate follow up in cases of glish and non-English tweets. This is probably due difficulty. to selection of keywords related to Syria and Ukraine, which allowed foreign phrases to come in the dataset. References News Headline Readability being a highly subjective [Ocon10] O’Connor, B., Balasubramanyan, R., Routledge, B. attribute, needs to be evaluated manually. A News R., & Smith, N. A. (2010), From tweets to polls: Headline is considered readable if majority of the users Linking text sentiment to public opinion time series, accessing the system can comprehend it without the ICWSM,11,122-129 . use of other resources. Further, it can be observed [Zan11] Zangerle, E., Gassler, W., & Specht, G. (2011). that 81.60% of our topics were labeled readable by Recommending#-tags in Twitter. In Proceedings of the Workshop on Semantic Adaptive Social Web (SASWeb language experts. The images related to the extracted 2011). CEUR Workshop Proceedings (Vol. 730, pp. 67-78). tweets were found to symbolize the news story with [Dom05] Domingos, P. (2005). Mining social networks for viral 97.67% accuracy. marketing. IEEE Intelligent Systems, 20(1), 80-82. Table 2 represents the number of topical clusters [Sol10] Solis, B. (2010). The information di- with increasing score(Ti) threshold. As can be ob- vide between traditional and new media, served, number of clusters decrease at a high rate with http://www.briansolis.com/2010/02/the-information- respect to the threshold value. Thereby, allowing us divide-the-socialization-of-news-and-dissemination/, Internet Draft (last accessed March 16, 2014) to select 0.3 as our base threshold. [Aie13] Aiello, L., Petkos, G., Martin, C., Corney, D., Pa- padopoulos, S., Skraba, R., Goker, A., Kompatsiaris, I., Table 1: Human Evaluation of topics Jaimes, A. (2013) Sensing trending topics in Twitter. Mul- timedia, IEEE Transactions on 15(6) 2681282. English 282 Language [Bif10] Bifet, A., & Frank, E. (2010).Sentiment knowledge dis- Non-English 256 covery in Twitter streaming data. In Discovery Science (pp. Good 439 1-15). Springer Berlin Heidelberg. News Headline Readability Bad 99 [Zha11] Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Related 84 Yan, H., & Li, X. (2011).Comparing Twitter and tradi- Topics with images Unrelated 2 tional media using topic models. In Advances in Informa- tion Retrieval (pp. 338-349). Springer Berlin Heidelberg. [Hon10] Hong, L., & Davison, B. D. (2010).Empirical study Table 2: Number of clusters v/s Score(Ti) Threshold of topic modeling in Twitter. In Proceedings of the First Workshop on Social Media Analytics (pp. 80-88). ACM. Threshold 0.25 0.30 0.35 0.40 No. of [Bol11] Bollen, J., Mao, H., & Pepe, A. (2011). Modeling public 754 538 467 261 mood and emotion: Twitter sentiment and socio-economic Clusters phenomena. In ICWSM. [Goy13] Goyal, P., Mehala, N., & Bansal, A. (2013). A robust Table 3 represents sample topics along with Headline, approach for finding conceptually related queries using fea- timestamp, related tweets and set of keywords. The ture selection and tripartite graph structure. Journal of In- readers should note that not all the tweets in the story formation Science, 39(5), 575-592. [Phu10] Phuvipadawat, S., & Murata, T. (2010). Breaking [Hav03] Haveliwala, T. H. (2003). Topic-sensitive pagerank: A news detection and tracking in Twitter. In Web Intelli- context-sensitive ranking algorithm for web search. Knowl- gence and Intelligent Agent Technology (WI-IAT), 2010 edge and Data Engineering, IEEE Transactions on, 15(4), IEEE/WIC/ACM International Conference on (Vol. 3, pp. 784-796. 120-123). IEEE. [Shu11] Shubhankar, K., Singh, A. P., & Pudi, V. (2011). An [San09] Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieber- efficient algorithm for topic ranking and modeling topic man, M. D., & Sperling, J. (2009). Twitterstand: news in evolution. In Database and Expert Systems Applications tweets. In Proceedings of the 17th ACM SIGSPATIAL In- (pp. 320-330). Springer Berlin Heidelberg. ternational Conference on Advances in Geographic Infor- mation Systems (pp. 42-51). ACM. [Pap14] Papadopoulos S., Corney D., Aiello L. (2014). SNOW 2014 Data Challenge: Assessing the Performance of News [Ban00] Banko, M., Mittal, V. O., & Witbrock, M. J. (2000). Topic Detection Methods in Social Media. In Proceedings Headline generation based on statistical translation. In of the SNOW 2014 Data Challenge. Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 318-325). Association [Hut13] Hutto, C. J., Yardi, S., & Gilbert, E. (2013). A longitu- for Computational Linguistics. dinal study of follow predictors on twitter. In Proceedings [Jin01] Jin, R., & Hauptmann, A. G. (2001). Generation Us- of the SIGCHI Conference on Human Factors in Comput- ing a Training Corpus. In Computational Linguistics and ing Systems (pp. 821-830). ACM. Intelligent Text Processing (pp. 208-215). Springer Berlin [Oco10] O’Connor, B., Krieger, M., & Ahn, D. (2010). Tweet- Heidelberg. Motif: Exploratory Search and Topic Summarization for [Yaj12] YaJuan, D. U. A. N., WEIF uRu, C. Z., Heung, Z. M., Twitter. In Proceedings of the 4th Int’l AAAI Conference & Shum Y. (2012). Twitter topic summarization by rank- on Weblogs and Social Media. ing tweets using social influence and content quality. In Proceedings of the 24th International Conference on Com- putational Linguistics (pp. 763-780). Table 3: Sample Headlines and Stories HEADLINE KEYWORDS TWEETS 1) #Syria #Homs #Aleppo Leader of Syrian Syria alQaeda Rivals, militant group leader gives alQaeda, challenges rivals rivals ultimatum. #Syria, 2) RT: Top group, al-qaeda leader (25-02-14 19:00) ultimatum abu khalid al- Suri was reportedly killed by a rival.#Syria #ukraine Rada says try Yanukovich before Int Crime Court. Should be Ukraine tried by parliament Yanukovich, Ukrainians for wants Ukraine, crimes against Yanukovich parliament, Ukrainians! tried international court 2) Yanukovich court papers:Snipers who (25-02-14 18:45) killed dozens of protesters came from Ukraine’s ”omega” special forces.#euromaiden Russian President 1) Putin orders Vladimir Putin troops to prepare ordered test in case of ’a crisis’ combat readiness Vladimir, in Ukraine as for troops Putin, tensions step up. stationed region Yanukovych, Report on The that touches Russia, 530 now @tv3News Ukraines Ukraine 2) Russia puts northern troops on border alert amid (26-02-14 17:30) Ukraine tension. Ukraine leaders Not in my riot, disband riot wildest dreams I’d Ukraine, police who imagine Arab police police, kneel down doing so unit, ask forgive- #Ukraine riot crackdown, ness from police asking Kiev, the people forgiveness protesters (26-02-14 17:45) from protesters The equivalent of Bitcoin turmoil war when states are rumoured 375m theft, in danger. theft closes Gox, major Bitcoin, Bitcoin exchange. exchange exchange fears (26-02-14 03:30) $400m theft #bitcoin bitcoin major # Business?Exchange turning point time, closes as virtual mtgox exchange website, money vanishes: abruptly stops transactions, ANGRY Bitcoin trading 774000 being, enthusiasts are bitcoin reported Bitcoin protesting the missing collapse (26-02-14 03:45)