TwiBiNG: A Bipartite News Generator Using Twitter

TwiBiNG: A Bipartite News Generator Using Twitter YashvardhanSharma Department of Computer Science Birla Institute of Technology & Science Pilani

333 031 India

DivyanshBhatia Department of Computer Science Birla Institute of Technology & Science Pilani

333 031 India

VivekKishoreChoudhary Department of Computer Science Birla Institute of Technology & Science Pilani

333 031 India

TwiBiNG: A Bipartite News Generator Using Twitter BD2ABAA467F8AEBCE5BA1EF8D8A50B00 GROBID - A machine learning software for extracting information from scholarly documents

Online Journalism is being seen as future of Journalism. News Professionals are vying to capture newsworthy stories that emerge from crowd. Live Social Media especially Twitter is generating enormous volumes of data every minute. It becomes difficult to select credible and relevant tweets that may form quality news among others. The problem intensifies due to the freedom of Twitter being an informal language. Generating headlines by solving this problem may still not be relevant and may face the question of authenticity. Given a set of keywords and a time period this problem becomes manageable and can be solved efficiently. We propose a bipartite algorithm that clusters authentic tweets based on key phrases and ranks the clusters based on trends in each timeslot. Finally, we present an approach to select those topics which have sufficient content to form a story

Introduction

In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014, published at http://ceur-ws.org public. With the advent of Web 2.0 most of the journalism has gone the online way innovating the term "Online Journalism". Since users of the web are ready to share each and every activity they do in their lives due to the free nature of the world, this has made professionals content hungry. Twitter generates an amount of information that can outrun the storage space of many servers in a few months. Developing a user centered tool that can process this information in real time has become need of the day for professional journalists.

From the Arab Spring to the Oscars 2014 Selfie tweets have changed the way the world shares information. Scholars today can predict election results better than ever before [Ocon10]. The "#" Hashtag feature in Twitter has made event stories easier to capture [Zan11]. As a result social network mining, originally loaded with clustering and classification of online worlds, is leveraging on understanding evolution of real-world events [Dom05].Adding another feather to its cap is the fact that newspaper and magazines have started publishing content on social media sites like Twitter and Facebook. To summarize, news no longer breaks it tweets (Solis) [Sol10].

The goal of this paper is to demonstrate the use of Twitter to monitor headlines online and generate news stories. We propose a standalone system TwiB-iNG to extract tweets related to user defined keywords and propose ranked news summaries based on trend and relevance of tweets they contain. The key novelty behind TwiBiNG is generation of Bi-partitite clusters of tweet intentions and use of Longest common-subsequence(LCS) algorithm along with a few tweet creator's details to separate relevant tweets from irrele-vant ones. This approach not only produces better clusters but also generates stories that are authentic, contains less spam and more importantly are distinct from each other. Also since we base our approach on intention of tweets it makes it language independent. Readers should note that by intention we refer to the general subject of tweet; not the intention of the user posting it. The selected datasets were developed from tweets collected between Tue 25 Feb, 18:00 GMT and Wed 26 Feb, 18:00 GMT based on keywords "Syria","Ukraine","Terror","Bitcoin". We collected 1,041,062 unique tweets from 556,295 users which included 648,651 retweets and 135,141 replies. The crawl also included messages sent from or to a set of around 5000 journalists/commentators.

In short our contributions can be summarized as:

X We incorporated retweets in BNgrams clustering [Aie13] and hence improved upon the trend ranking of keywords.

X We clustered our tweets based on bipartitite graph thereby clubbing similar intention tweets together.

X We reduced the effect of informal text in Twitter by using LCS based similarity score while dealing with keywords.

X We presented news headlines by ranking clustered tweets based on relevance to the clustered keyword set and use 'Part Of Speech' tagger to make them readable.

The remainder of the paper is organized as follows: In Section 2 we take a look at existing algorithms and approaches.Section 3 details about proposed methodologies and approaches. Section 4 provides a discussion of results. Section 5 concludes the work by laying a foundation for future work.

Related Work

The work of generating headlines using social media can be seen as a combination of two branches 1) Information Retrieval and Text Mining and 2) Natural Language Processing. Scholars have worked extensively on Twitter data using both the fields. Here we present an overview of existing approaches in both fields:

Text Mining on Twitter Content

Twitter has its own conventions for language while (@) is used to mention user, (#) is used to identify events and "RT" is used to represent a retweet. Bifet and Frank [Bif10] use these features for opinion mining. Zhao et al. [Zha11] develop a Twitter-LDA model through content analysis. The restricted length (140 characters) and informal text are some issues that pose problems to many text mining researchers (Hong and Davison [Hon10]). Bollen et al. [Bol11] used terms expressing positive and negative behavior for sentiment analysis on Twitter. Text Clustering is another where scholars have worked for content analysis.

Goyal and Mehala [Goy13] presented an approach to find conceptually related queries by clustering on bipartite and tripartite graphs. We try to propose a similar approach for Twitter content analysis using Bipartite graph. [Aie13] proposes trend based tweet clustering approaches. We present an approach that uses a modified BNgram clustering approach, which has motivation from original approach of [Aie13]. Phuvipadawat and Murata [Phu10] present a breaking news prediction algorithm that clusters tweets based on First Story detection after segmenting different stories. TwitterStand [San09] develops a "leader-follower" text clustering algorithm.

Natural Language Processing

Headline Generation has been active area of research among NLP researchers. Most of the scholars work here by selecting a proper set of keywords and finding a way to combine them in a way that forms a grammatically coherent and meaningful sentence. In Banko et al. [Ban00] authors present a statistical approach to term selection and term ordering process that depicts the power of non-extractive summarization whereas Jin and Hauptman [Jin01] presents an approach for extractive summarization along with a Bayesian approach. They also discuss various issues in keyword selection for headline generation. We use Part of speech tagging along with most relevant tweet identification to generate meaningful user readable headline.

Methodology

We divide our process in four phases 1) Data preparation, 2) Data Clustering 3) Cluster Ranking, 4) Tweet Ranking and Headline generation. We will now describe our TwiBiNG system phase by phase:

Data Preparation

Once the data set for a given timeslot is ready by extracting tweets related to a given set of seeds and keywords, we tag entities in tweets using Stanford's Partof-Speech Tagger and extract nouns, HashTags, Users. We ignore other parts of speech, thereby concentrating more on the subject than the predicate. This is because in a given timeslot, it is difficult for predicate to change rapidly for the same subject while the reverse may not be true. These tagged words are referred as key phrases (KP) from now on. We now decide on trending keywords.

We rank keywords using a modified df-idft [Aie13] score by incorporating retweets:

R(k i ) = Ri−Ri−1 max(Ri,Ri−1) Score(k i ) = t i * log(1 + R(ki+1) ti−1+1 )

Here R i represents number of retweets for keyword k in timeslot i and t i represents number of tweets for keyword k. Since a keyword may be related to unbounded number of tweets and retweets in a timeslot deciding on threshold is difficult. Therefore, we decided to normalize the score for each keyword using min-max normalization. Let < K > be the set of tweets in a slot i then normalized score is given by:

N ormalizedScore(N K i ) = Score(k i ) − min(Score(< K >)) max(Score(< K >)) − min(Score(< K >))

The threshold for these normalized keywords was decided to be 0.0075 through experiments. We select the keywords above this threshold and store them in a set (S i ). We observed that for each timeslot at this threshold we get around 800-875 trending keywords. Once this set was ready we assigned tweets to each keyword, i.e. we reversed the bipartite graph of Figure 1. We now filter the tweets based on user details specifically number of followers and status counts. This step is necessary in order to increase authenticity and reduce tweets containing spamming content. Since clustering is based on tweet intention, not performing the previous step may hamper clustering performance. Also the generated stories may not be considered quality news. Our experiments based on (Hutto et. al. [Hut13]) decided that users with a follower count>600 and tweet count>6000 may be considered authentic and considering tweets by these users alone will significantly improve system performance. Now since we are building a user centered news generator we want tweets related to the keywords defined by user to improve relevancy. For this purpose we scan all keywords in (S i ) and compute their Similarity with user-defined keywords (U i ).

LCS(S i , Ui ) = LongestCommonSubsequence(S i , U i )

If any LCS(S i , U i ) contains U i then we include all the tweets related to S i in set < T U i > which contains tweet ids related to user centered keywords. We scan the database for the timeslot again and remove those tweets which are not contained in < T U i > (usercentric tweets). At the end of this stage we end up with a set of tweets and related keywords that can be considered authentic for a news story.

Intention based Tweet Clustering

We use the approach used in [Goy13] to use bipartite clustering of tweets. The basic aim here is to get real intention of tweets in clusters. Algorithm 1 presents an incremental bipartite algorithm to cluster tweets and keywords. Once we have a set of clusters we know the intention of tweets. As can be seen the threshold is kept > 0.5, which signifies that keywords merged should have an intention similarity of more than 50%. Readers requiring more specific tweets to be clustered together may increase the similarity but this comes at a cost of duplicate tweets being merged together. As can be observed in Algorithm 1, since the clustering is on basis of basis of Intersection(T i ,T j ) there will be duplicate tweets in cluster but a news story containing a lot of duplicate tweets would be considered of poor quality. So removing duplicate content becomes a prime task now.

Data

< F T S i > = < CT S i >-< D i >; < CS i , < CT S i >> =< CS i ,

< F T S i >> end

Algorithm 2: To remove Duplicate Tweets from Cluster

The motivation behind threshold of 0.65 in Algorithm 2 can be observed in O'Connor [Oco10]. We end this phase with a cluster of keywords and their relevant set of tweets. So now we know the intention of our keywords and we are ready to rank them.

Cluster Ranking

Up until this phase we have obtained required set of clusters. We now need to rank them. Although different authors [Yaj12][Hav03] [Shu11] have proposed efficient topic ranking methods they have a common feature that relevance to considered keywords is considered an important issue. We make use of this fact and of normalized trend score to generate a ranking score for clusters. Since we are vying for a user centric tool our clusters should be most relevant to their intention. Also since we have to generate headlines trend needs a special attention. Keeping the above two facts we present our cluster ranking methodology. Using < U i > we collected tweets for relevant keywords in section 3.1 as set < T U i >. We calculate Relevancy of cluster CS i having tweets < F S i > as:

RCS i = Relevancy(CS i ) = M ax(Intersection(Ui,F Si) U nion(Ui,F Si)

This relevancy score gives us an indication about the relation of cluster to the user's intention. (N ormalizedScoreof CSi) This factor indicates that how much a cluster is trending. The idea of taking Max(Normalized Score of CS i ) has its Motivation from BNgram clustering approach used in [Aie13]. Readers can think of T CS i as a boost factor for relevance.

T CS i = T rend(CS i ) = e −M ax

ClusterScore(CScr i ) = RCS i * T CS i

We now rank the clusters based on (CScr i ). At the end of this phase we have ranked our clusters and to avoid any confusion further we now refer them as < CS ir , < F T S ir >>.

Tweet Ranking in Clusters

Now once clusters are ranked we need to rank tweets contained in them in order to present them in most relevant order. Before introducing ranking calculations we need to introduce expanded keyword set. This can be seen as a prerequisite in the step of headline formation. This step is necessary and relevant since some of the clusters may contain a small number of keywords and need sufficient information to generate a story. We represent the expanded cluster set as < ECS i > . Let set < K t > represent set of keywords for tweet T i . Then relevance score for T i is calculated as

Score(T i) = Intersection(< K t >, < ECS i >) U nion(< K t >, < ECS i >)

Now we rank our tweets based on Score(T i ). At the end of this phase, we filter out tweets which have a score(T i ) ¡ 0.3. The threshold 0.3 is based on the results of our experiments, as described in Table 2.

Increasing the threshold provides better quality stories but reduces the number of stories at a high rate.

Hence, readers requiring more focused stories may increase the threshold.

Cluster Selection and Headline Generation

In this phase we provide an approach to decide which clusters can form news. As can be observed not all clusters form a story, we must judiciously decide on clusters to form news. By experiments, we observed the following Heuristic may be used to select quality clusters: H3.5.1: Those clusters tend to form quality stories which contain at least four keywords, one Hashtag keyword, and is related to at least three tweets .Further , number of non Hashtag keywords should be more than Hashtag keywords. The rationale behind this approach can be explained. The clusters having excessive amounts of hashtags as keywords are usually related to tweets with almost similar content. Having a hashtag allows users to easily identify events and more than three distinct tweets allows us to form a sequence of events. Since, we are needed to identify a fixed number of topics, we follow H3.5.1 and scan all the clusters in < Cs ir > up until the specified number of clusters in each timeslot. Hence, we follow a dynamic approach that is independent of cluster count.

For Headline Generation we order the keywords in accordance to top ranked tweet in cluster and use POS tagger to connect the keywords. We believe that better approaches to form headlines exist, but we were dealing with informal language so we need to take support from tweet intent to form them. Readers may improve upon this aspect by considering statistical techniques mentioned in section 2.2.

Results and discussion

Table 1 depicts human evaluation of results as carried out by authors. The official evaluation results of our method in the Data Challenge are included in snow2014dc [Pap14]. The language content shows that our topics were evenly distributed between English and non-English tweets. This is probably due to selection of keywords related to Syria and Ukraine, which allowed foreign phrases to come in the dataset. News Headline Readability being a highly subjective attribute, needs to be evaluated manually. A News Headline is considered readable if majority of the users accessing the system can comprehend it without the use of other resources. Further, it can be observed that 81.60% of our topics were labeled readable by language experts. The images related to the extracted tweets were found to symbolize the news story with 97.67% accuracy.

Table 2 represents the number of topical clusters with increasing score(Ti) threshold. As can be observed, number of clusters decrease at a high rate with respect to the threshold value. Thereby, allowing us to select 0.3 as our base threshold.

< CS i , < CT S i >> Set of tweets in a cluster of keywords CS i Result: : < CS i , < F T S i >> Final Set of tweets and clusters while cs i in CS i do while t i in CT S i doplicate tweets from cluster:Data: j=i+1if < D i >.contains< t j > = false thenwhile t j in CT S j dosim(t i , t j )=LCS(t i , t j )/Min(t i .length,t j .length)if sim(t i , t j ) > 0.65 then< D i >.add(t j );endendendendset oftweetsLet S: represent set of unique keywordswhile clusters exist with similarity > threshold doflag=0;while s i in S doj=i+1;while t j in T doSim(s Remove s j from I flag=1;endendif flag=0 thenbendreak;endendAlgorithm 1: Bipartite Clustering of Tweets usingKeywordsIn Algorithm 2 we present an algorithm to remove du-

: I< S i , < T S i >> S i and T S i denotes a set of keywords and related tweets Result: O< CS i , < CT S i >> clustered i ,s j ) =Intersection(T s i ,T s j )/Union(T s i ,T s j ); if Sim (s i , s j ) > 0.5 then I< s i , < T s i >> = I< s i = s j , < U nion(T s i , T s j ) >>

Table 1 :1Human Evaluation of topicsLanguageEnglish Non-English 256 282News Headline ReadabilityGood Bad439 99Topics with imagesRelated Unrelated84 2Table 2: Number of clusters v/s Score(Ti) ThresholdThreshold 0.25 0.30 0.35 0.40No. of Clusters754538467261

Table 33represents sample topics along with Headline, timestamp, related tweets and set of keywords. The readers should note that not all the tweets in the story are covered, but only the most relevant are shown for clarity.These results show an improved performance over previously existing systems. A limitation of this system is not including user's community which may have allowed us to form tripartite clustering, thereby improving clustering quality at a low cost. Use of better known String matching algorithms may improve cluster quality. Our use of bipartite clustering algorithm can allow future researchers to explore more into this field.

Acknowledgement

Authors owe a debt of gratitude to Dr. P. Goyal and Dr. N. Mehala for their constructive criticism and innovative ideas that formed the foundation of this study. We would like to extend special thanks Birla Institute of Technology and Science for providing resources without which this work would never have been completed. We would like to thank SNOW'14 organizers for giving us a chance to work on social sensor project and for their immediate follow up in cases of difficulty.

From tweets to polls: Linking text sentiment to public opinion time series BO'connor RBalasubramanyan BRRoutledge NASmith ICWSM 11 2010 Recommending#-tags in Twitter EZangerle WGassler GSpecht Proceedings of the Workshop on Semantic Adaptive Social Web (SASWeb 2011) CEUR Workshop Proceedings the Workshop on Semantic Adaptive Social Web (SASWeb 2011) 2011 730 Mining social networks for viral marketing PDomingos IEEE Intelligent Systems 20 1 2005 The information divide between traditional and new media BSolis 2010. March 16, 2014 Internet Draft Sensing trending topics in Twitter. Multimedia LAiello GPetkos CMartin DCorney SPapadopoulos RSkraba AGoker IKompatsiaris AJaimes IEEE Transactions on 15 6 2681282 2013 Sentiment knowledge discovery in Twitter streaming data ABifet EFrank Discovery Science

Berlin Heidelberg

Springer 2010 Comparing Twitter and traditional media using topic models WXZhao JJiang JWeng JHe EPLim HYan XLi Advances in Information Retrieval

Berlin Heidelberg

Springer 2011 Empirical study of topic modeling in Twitter LHong BDDavison Proceedings of the First Workshop on Social Media Analytics the First Workshop on Social Media Analytics ACM 2010 Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena JBollen HMao APepe ICWSM 2011 A robust approach for finding conceptually related queries using feature selection and tripartite graph structure PGoyal NMehala ABansal Journal of Information Science 39 5 2013 Breaking news detection and tracking in Twitter SPhuvipadawat TMurata Web Intelligence and Intelligent Agent Technology WI-IAT 2010. 2010 IEEE/WIC/ACM International Conference on IEEE 3 Twitterstand: news in tweets JSankaranarayanan HSamet BETeitler MDLieberman JSperling Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems ACM 2009 Headline generation based on statistical translation MBanko VOMittal MJWitbrock Proceedings of the 38th Annual Meeting on Association for Computational Linguistics the 38th Annual Meeting on Association for Computational Linguistics 2000 Association for Computational Linguistics Generation Using a Training Corpus RJin AGHauptmann Computational Linguistics and Intelligent Text Processing

Berlin Heidelberg

Springer 2001 Twitter topic summarization by ranking tweets using social influence and content quality DU A NYajuan CZWeif Uru ZMHeung YShum Proceedings of the 24th International Conference on Computational Linguistics the 24th International Conference on Computational Linguistics 2012 Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search THHaveliwala Knowledge and Data Engineering 15 4 2003 IEEE Transactions on An efficient algorithm for topic ranking and modeling topic evolution KShubhankar APSingh VPudi Database and Expert Systems Applications

Berlin Heidelberg

Springer 2011 SNOW 2014 Data Challenge: Assessing the Performance of News Topic Detection Methods in Social Media SPapadopoulos DCorney LAiello Proceedings of the SNOW 2014 Data Challenge the SNOW 2014 Data Challenge 2014 A longitudinal study of follow predictors on twitter CJHutto SYardi EGilbert Proceedings of the SIGCHI Conference on Human Factors in Computing Systems the SIGCHI Conference on Human Factors in Computing Systems ACM 2013 Tweet-Motif: Exploratory Search and Topic Summarization for Twitter BO'connor MKrieger DAhn Proceedings of the 4th Int'l AAAI Conference on Weblogs and Social Media the 4th Int'l AAAI Conference on Weblogs and Social Media 2010