=Paper=
{{Paper
|id=Vol-1150/sharma
|storemode=property
|title=TwiBiNG: A Bipartite News Generator Using Twitter
|pdfUrl=https://ceur-ws.org/Vol-1150/sharma.pdf
|volume=Vol-1150
|dblpUrl=https://dblp.org/rec/conf/www/SharmaBC14
}}
==TwiBiNG: A Bipartite News Generator Using Twitter==
TwiBiNG: A Bipartite News Generator Using Twitter
Yashvardhan Sharma Divyansh Bhatia
Department of Computer Science Department of Computer Science
Birla Institute of Technology & Science Birla Institute of Technology & Science
Pilani, India 333 031 Pilani, India 333 031
yash@pilani.bits-pilani.ac.in h2009399@pilani.bits-pilani.ac.in
Vivek Kishore Choudhary
Department of Computer Science
Birla Institute of Technology & Science
Pilani, India 333 031
f2012650@pilani.bits-pilani.ac.in
public. With the advent of Web 2.0 most of the jour-
nalism has gone the online way innovating the term
Abstract ”Online Journalism”. Since users of the web are ready
to share each and every activity they do in their lives
Online Journalism is being seen as future of due to the free nature of the world, this has made
Journalism. News Professionals are vying to professionals content hungry. Twitter generates an
capture newsworthy stories that emerge from amount of information that can outrun the storage
crowd. Live Social Media especially Twitter space of many servers in a few months. Developing
is generating enormous volumes of data every a user centered tool that can process this information
minute. It becomes difficult to select credi- in real time has become need of the day for professional
ble and relevant tweets that may form quality journalists.
news among others. The problem intensifies From the Arab Spring to the Oscars 2014 Selfie
due to the freedom of Twitter being an infor- tweets have changed the way the world shares infor-
mal language. Generating headlines by solv- mation. Scholars today can predict election results
ing this problem may still not be relevant and better than ever before [Ocon10]. The ”#” Hashtag
may face the question of authenticity. Given a feature in Twitter has made event stories easier to cap-
set of keywords and a time period this problem ture [Zan11]. As a result social network mining, orig-
becomes manageable and can be solved effi- inally loaded with clustering and classification of on-
ciently. We propose a bipartite algorithm that line worlds, is leveraging on understanding evolution
clusters authentic tweets based on key phrases of real-world events [Dom05].Adding another feather
and ranks the clusters based on trends in each to its cap is the fact that newspaper and magazines
timeslot. Finally, we present an approach to have started publishing content on social media sites
select those topics which have sufficient con- like Twitter and Facebook. To summarize, news no
tent to form a story longer breaks it tweets (Solis)[Sol10].
The goal of this paper is to demonstrate the use
1 Introduction of Twitter to monitor headlines online and generate
news stories. We propose a standalone system TwiB-
Journalism is the state of art that disseminates infor- iNG to extract tweets related to user defined keywords
mation and provides analysis of news to the general and propose ranked news summaries based on trend
Copyright © by the paper’s authors. Copying permitted only
and relevance of tweets they contain. The key novelty
behind TwiBiNG is generation of Bi-partitite clusters
for private and academic purposes.
In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedings
of tweet intentions and use of Longest common-sub-
of the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014, sequence(LCS) algorithm along with a few tweet cre-
published at http://ceur-ws.org ator’s details to separate relevant tweets from irrele-
vant ones. This approach not only produces better characters) and informal text are some issues that pose
clusters but also generates stories that are authentic, problems to many text mining researchers (Hong and
contains less spam and more importantly are distinct Davison [Hon10]). Bollen et al. [Bol11] used terms ex-
from each other. Also since we base our approach pressing positive and negative behavior for sentiment
on intention of tweets it makes it language indepen- analysis on Twitter.
dent. Readers should note that by intention we refer Text Clustering is another where scholars have
to the general subject of tweet; not the intention of worked for content analysis. Goyal and Mehala
the user posting it. The selected datasets were de- [Goy13] presented an approach to find conceptually re-
veloped from tweets collected between Tue 25 Feb, lated queries by clustering on bipartite and tripartite
18:00 GMT and Wed 26 Feb, 18:00 GMT based on graphs. We try to propose a similar approach for Twit-
keywords ”Syria”,”Ukraine”,”Terror”,”Bitcoin”. We ter content analysis using Bipartite graph. [Aie13] pro-
collected 1,041,062 unique tweets from 556,295 users poses trend based tweet clustering approaches. We
which included 648,651 retweets and 135,141 replies. present an approach that uses a modified BNgram
The crawl also included messages sent from or to a set clustering approach, which has motivation from orig-
of around 5000 journalists/commentators. inal approach of [Aie13]. Phuvipadawat and Murata
In short our contributions can be summarized as: [Phu10] present a breaking news prediction algorithm
that clusters tweets based on First Story detection af-
We incorporated retweets in BNgrams clustering ter segmenting different stories. TwitterStand [San09]
[Aie13] and hence improved upon the trend rank- develops a ”leader-follower” text clustering algorithm.
ing of keywords.
2.2 Natural Language Processing
We clustered our tweets based on bipartitite
graph thereby clubbing similar intention tweets Headline Generation has been active area of research
together. among NLP researchers. Most of the scholars work
here by selecting a proper set of keywords and finding
We reduced the effect of informal text in Twitter a way to combine them in a way that forms a gram-
by using LCS based similarity score while dealing matically coherent and meaningful sentence. In Banko
with keywords. et al.[Ban00] authors present a statistical approach to
We presented news headlines by ranking clustered term selection and term ordering process that depicts
tweets based on relevance to the clustered key- the power of non-extractive summarization whereas
word set and use ‘Part Of Speech’ tagger to make Jin and Hauptman [Jin01] presents an approach for
them readable. extractive summarization along with a Bayesian ap-
proach. They also discuss various issues in keyword se-
The remainder of the paper is organized as follows: In lection for headline generation. We use Part of speech
Section 2 we take a look at existing algorithms and ap- tagging along with most relevant tweet identification
proaches.Section 3 details about proposed methodolo- to generate meaningful user readable headline.
gies and approaches. Section 4 provides a discussion
of results. Section 5 concludes the work by laying a 3 Methodology
foundation for future work.
We divide our process in four phases 1) Data prepara-
tion, 2) Data Clustering 3) Cluster Ranking, 4) Tweet
2 Related Work
Ranking and Headline generation. We will now de-
The work of generating headlines using social media scribe our TwiBiNG system phase by phase:
can be seen as a combination of two branches 1) Infor-
mation Retrieval and Text Mining and 2) Natural Lan- 3.1 Data Preparation
guage Processing. Scholars have worked extensively on Once the data set for a given timeslot is ready by ex-
Twitter data using both the fields. Here we present an tracting tweets related to a given set of seeds and key-
overview of existing approaches in both fields: words, we tag entities in tweets using Stanford’s Part-
of-Speech Tagger and extract nouns, HashTags, Users.
2.1 Text Mining on Twitter Content
We ignore other parts of speech, thereby concentrating
Twitter has its own conventions for language while more on the subject than the predicate. This is be-
(@) is used to mention user, (#) is used to identify cause in a given timeslot, it is difficult for predicate to
events and ”RT” is used to represent a retweet. Bifet change rapidly for the same subject while the reverse
and Frank [Bif10] use these features for opinion min- may not be true. These tagged words are referred as
ing. Zhao et al.[Zha11] develop a Twitter-LDA model key phrases (KP) from now on. We now decide on
through content analysis. The restricted length (140 trending keywords.
We rank keywords using a modified df-idft [Aie13] If any LCS(Si , Ui ) contains Ui then we include all the
score by incorporating retweets: tweets related to Si in set < T Ui > which contains
Ri −Ri−1 tweet ids related to user centered keywords. We scan
R(ki ) = max(Ri ,Ri−1 ) the database for the timeslot again and remove those
Score(ki ) = ti ∗ log(1 + R(ki +1) tweets which are not contained in < T Ui > (user-
ti−1 +1 )
centric tweets). At the end of this stage we end up
Here Ri represents number of retweets for keyword k with a set of tweets and related keywords that can be
in timeslot i and ti represents number of tweets for considered authentic for a news story.
keyword k. Since a keyword may be related to un-
bounded number of tweets and retweets in a timeslot 3.2 Intention based Tweet Clustering
deciding on threshold is difficult. Therefore, we de- We use the approach used in [Goy13] to use bipartite
cided to normalize the score for each keyword using clustering of tweets. The basic aim here is to get real
min-max normalization. Let < K > be the set of intention of tweets in clusters. Algorithm 1 presents
tweets in a slot i then normalized score is given by: an incremental bipartite algorithm to cluster tweets
N ormalizedScore(N Ki ) = and keywords. Once we have a set of clusters we know
the intention of tweets. As can be seen the threshold
Score(ki ) − min(Score(< K >))
is kept > 0.5, which signifies that keywords merged
max(Score(< K >)) − min(Score(< K >)) should have an intention similarity of more than 50%.
The threshold for these normalized keywords was de- Readers requiring more specific tweets to be clustered
cided to be 0.0075 through experiments. We select the together may increase the similarity but this comes at
keywords above this threshold and store them in a set a cost of duplicate tweets being merged together. As
(Si ). We observed that for each timeslot at this thresh- can be observed in Algorithm 1, since the clustering
old we get around 800-875 trending keywords. Once is on basis of basis of Intersection(Ti ,Tj ) there will be
this set was ready we assigned tweets to each keyword, duplicate tweets in cluster but a news story contain-
i.e. we reversed the bipartite graph of Figure 1. We ing a lot of duplicate tweets would be considered of
now filter the tweets based on user details specifically poor quality. So removing duplicate content becomes
number of followers and status counts. This step is a prime task now.
necessary in order to increase authenticity and reduce Data: I< Si , < T Si >> Si and T Si denotes a set
tweets containing spamming content. Since clustering of keywords and related tweets
is based on tweet intention, not performing the previ- Result: O< CSi , < CT Si >> clustered set of
ous step may hamper clustering performance. Also the tweets
generated stories may not be considered quality news. Let S: represent set of unique keywords
Our experiments based on (Hutto et. al. [Hut13]) de- while clusters exist with similarity > threshold do
cided that users with a follower count>600 and tweet flag=0;
count>6000 may be considered authentic and consid- while si in S do
ering tweets by these users alone will significantly im- j=i+1;
prove system performance. while tj in T do
Sim(si ,sj )
=Intersection(T si ,T sj )/Union(T si ,T sj );
if Sim (si , sj ) > 0.5 then
I< si , < T si >> =
I< si = sj , < U nion(T si , T sj ) >>
Remove sj from I flag=1;
end
end
if flag=0 then
b
end
Now since we are building a user centered news gen- reak;
erator we want tweets related to the keywords defined end
end
by user to improve relevancy. For this purpose we scan
all keywords in (Si ) and compute their Similarity with Algorithm 1: Bipartite Clustering of Tweets using
user-defined keywords (Ui ). Keywords
LCS(Si , Ui ) = LongestCommonSubsequence(Si , Ui ) In Algorithm 2 we present an algorithm to remove du-
plicate tweets from cluster: has its Motivation from BNgram clustering approach
Data: < CSi , < CT Si >> Set of tweets in a cluster used in [Aie13]. Readers can think of T CSi as a boost
of keywords CSi factor for relevance.
Result: : < CSi , < F T Si >> Final Set of tweets
ClusterScore(CScri ) = RCSi ∗ T CSi
and clusters
while csi in CSi do
We now rank the clusters based on (CScri ). At the
while ti in CT Si do
end of this phase we have ranked our clusters and to
j=i+1
avoid any confusion further we now refer them as <
if < Di >.contains< tj > = false then
CSir , < F T Sir >>.
while tj in CT Sj do
sim(ti , tj )= 3.4 Tweet Ranking in Clusters
LCS(ti , tj )/Min(ti .length,tj .length)
if sim(ti , tj ) > 0.65 then Now once clusters are ranked we need to rank tweets
< Di >.add(tj ); contained in them in order to present them in most rel-
end evant order. Before introducing ranking calculations
end we need to introduce expanded keyword set. This can
end be seen as a prerequisite in the step of headline forma-
end
< F T Si > = < CT Si >-< Di >; tion. This step is necessary and relevant since some of
< CSi , < CT Si >> =< CSi , < F T Si >> the clusters may contain a small number of keywords
end and need sufficient information to generate a story. We
represent the expanded cluster set as < ECSi > . Let
Algorithm 2: To remove Duplicate Tweets from Clus- set < Kt > represent set of keywords for tweet Ti .
ter Then relevance score for Ti is calculated as
The motivation behind threshold of 0.65 in Algo- Intersection(< Kt >, < ECSi >)
rithm 2 can be observed in O’Connor [Oco10]. We Score(T i) =
U nion(< Kt >, < ECSi >)
end this phase with a cluster of keywords and their
relevant set of tweets. So now we know the intention Now we rank our tweets based on Score(Ti ). At the
of our keywords and we are ready to rank them. end of this phase, we filter out tweets which have a
score(Ti ) ¡ 0.3. The threshold 0.3 is based on the
3.3 Cluster Ranking results of our experiments, as described in Table 2.
Up until this phase we have obtained required set of Increasing the threshold provides better quality sto-
clusters. We now need to rank them. Although differ- ries but reduces the number of stories at a high rate.
ent authors [Yaj12][Hav03][Shu11] have proposed effi- Hence, readers requiring more focused stories may in-
cient topic ranking methods they have a common fea- crease the threshold.
ture that relevance to considered keywords is consid-
ered an important issue. We make use of this fact and 3.5 Cluster Selection and Headline Genera-
of normalized trend score to generate a ranking score tion
for clusters. Since we are vying for a user centric tool In this phase we provide an approach to decide which
our clusters should be most relevant to their inten- clusters can form news. As can be observed not all
tion. Also since we have to generate headlines trend clusters form a story, we must judiciously decide on
needs a special attention. Keeping the above two facts clusters to form news. By experiments, we observed
we present our cluster ranking methodology. Using the following Heuristic may be used to select quality
< Ui > we collected tweets for relevant keywords in clusters: H3.5.1: Those clusters tend to form quality
section 3.1 as set < T Ui >. We calculate Relevancy of stories which contain at least four keywords, one Hash-
cluster CSi having tweets < F Si > as: tag keyword, and is related to at least three tweets
.Further , number of non Hashtag keywords should be
RCSi = Relevancy(CSi ) = M ax(Intersection(U
U nion(Ui ,F Si )
i ,F Si )
more than Hashtag keywords.
The rationale behind this approach can be ex-
This relevancy score gives us an indication about the
plained. The clusters having excessive amounts of
relation of cluster to the user’s intention.
hashtags as keywords are usually related to tweets with
T CSi = T rend(CSi ) = e−M ax(N ormalizedScoreof CSi) almost similar content. Having a hashtag allows users
to easily identify events and more than three distinct
This factor indicates that how much a cluster is trend- tweets allows us to form a sequence of events. Since,
ing. The idea of taking Max(Normalized Score of CSi ) we are needed to identify a fixed number of topics, we
follow H3.5.1 and scan all the clusters in < Csir > up are covered, but only the most relevant are shown for
until the specified number of clusters in each timeslot. clarity.These results show an improved performance
Hence, we follow a dynamic approach that is indepen- over previously existing systems. A limitation of this
dent of cluster count. system is not including user’s community which may
For Headline Generation we order the keywords in have allowed us to form tripartite clustering, thereby
accordance to top ranked tweet in cluster and use POS improving clustering quality at a low cost. Use of bet-
tagger to connect the keywords. We believe that better ter known String matching algorithms may improve
approaches to form headlines exist, but we were deal- cluster quality. Our use of bipartite clustering algo-
ing with informal language so we need to take support rithm can allow future researchers to explore more into
from tweet intent to form them. Readers may improve this field.
upon this aspect by considering statistical techniques
mentioned in section 2.2. 5 Acknowledgement
Authors owe a debt of gratitude to Dr. P. Goyal and
4 Results and discussion Dr. N. Mehala for their constructive criticism and
innovative ideas that formed the foundation of this
Table 1 depicts human evaluation of results as car- study. We would like to extend special thanks Birla
ried out by authors. The official evaluation results Institute of Technology and Science for providing re-
of our method in the Data Challenge are included sources without which this work would never have been
completed. We would like to thank SNOW’14 orga-
in snow2014dc [Pap14]. The language content shows nizers for giving us a chance to work on social sensor
that our topics were evenly distributed between En- project and for their immediate follow up in cases of
glish and non-English tweets. This is probably due difficulty.
to selection of keywords related to Syria and Ukraine,
which allowed foreign phrases to come in the dataset. References
News Headline Readability being a highly subjective [Ocon10] O’Connor, B., Balasubramanyan, R., Routledge, B.
attribute, needs to be evaluated manually. A News R., & Smith, N. A. (2010), From tweets to polls:
Headline is considered readable if majority of the users Linking text sentiment to public opinion time series,
accessing the system can comprehend it without the ICWSM,11,122-129 .
use of other resources. Further, it can be observed [Zan11] Zangerle, E., Gassler, W., & Specht, G. (2011).
that 81.60% of our topics were labeled readable by Recommending#-tags in Twitter. In Proceedings of the
Workshop on Semantic Adaptive Social Web (SASWeb
language experts. The images related to the extracted 2011). CEUR Workshop Proceedings (Vol. 730, pp. 67-78).
tweets were found to symbolize the news story with
[Dom05] Domingos, P. (2005). Mining social networks for viral
97.67% accuracy. marketing. IEEE Intelligent Systems, 20(1), 80-82.
Table 2 represents the number of topical clusters
[Sol10] Solis, B. (2010). The information di-
with increasing score(Ti) threshold. As can be ob- vide between traditional and new media,
served, number of clusters decrease at a high rate with http://www.briansolis.com/2010/02/the-information-
respect to the threshold value. Thereby, allowing us divide-the-socialization-of-news-and-dissemination/,
Internet Draft (last accessed March 16, 2014)
to select 0.3 as our base threshold.
[Aie13] Aiello, L., Petkos, G., Martin, C., Corney, D., Pa-
padopoulos, S., Skraba, R., Goker, A., Kompatsiaris, I.,
Table 1: Human Evaluation of topics Jaimes, A. (2013) Sensing trending topics in Twitter. Mul-
timedia, IEEE Transactions on 15(6) 2681282.
English 282
Language [Bif10] Bifet, A., & Frank, E. (2010).Sentiment knowledge dis-
Non-English 256
covery in Twitter streaming data. In Discovery Science (pp.
Good 439 1-15). Springer Berlin Heidelberg.
News Headline Readability
Bad 99
[Zha11] Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P.,
Related 84 Yan, H., & Li, X. (2011).Comparing Twitter and tradi-
Topics with images
Unrelated 2 tional media using topic models. In Advances in Informa-
tion Retrieval (pp. 338-349). Springer Berlin Heidelberg.
[Hon10] Hong, L., & Davison, B. D. (2010).Empirical study
Table 2: Number of clusters v/s Score(Ti) Threshold of topic modeling in Twitter. In Proceedings of the First
Workshop on Social Media Analytics (pp. 80-88). ACM.
Threshold 0.25 0.30 0.35 0.40
No. of [Bol11] Bollen, J., Mao, H., & Pepe, A. (2011). Modeling public
754 538 467 261 mood and emotion: Twitter sentiment and socio-economic
Clusters phenomena. In ICWSM.
[Goy13] Goyal, P., Mehala, N., & Bansal, A. (2013). A robust
Table 3 represents sample topics along with Headline, approach for finding conceptually related queries using fea-
timestamp, related tweets and set of keywords. The ture selection and tripartite graph structure. Journal of In-
readers should note that not all the tweets in the story formation Science, 39(5), 575-592.
[Phu10] Phuvipadawat, S., & Murata, T. (2010). Breaking [Hav03] Haveliwala, T. H. (2003). Topic-sensitive pagerank: A
news detection and tracking in Twitter. In Web Intelli- context-sensitive ranking algorithm for web search. Knowl-
gence and Intelligent Agent Technology (WI-IAT), 2010 edge and Data Engineering, IEEE Transactions on, 15(4),
IEEE/WIC/ACM International Conference on (Vol. 3, pp. 784-796.
120-123). IEEE. [Shu11] Shubhankar, K., Singh, A. P., & Pudi, V. (2011). An
[San09] Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieber- efficient algorithm for topic ranking and modeling topic
man, M. D., & Sperling, J. (2009). Twitterstand: news in evolution. In Database and Expert Systems Applications
tweets. In Proceedings of the 17th ACM SIGSPATIAL In- (pp. 320-330). Springer Berlin Heidelberg.
ternational Conference on Advances in Geographic Infor-
mation Systems (pp. 42-51). ACM. [Pap14] Papadopoulos S., Corney D., Aiello L. (2014). SNOW
2014 Data Challenge: Assessing the Performance of News
[Ban00] Banko, M., Mittal, V. O., & Witbrock, M. J. (2000).
Topic Detection Methods in Social Media. In Proceedings
Headline generation based on statistical translation. In
of the SNOW 2014 Data Challenge.
Proceedings of the 38th Annual Meeting on Association
for Computational Linguistics (pp. 318-325). Association [Hut13] Hutto, C. J., Yardi, S., & Gilbert, E. (2013). A longitu-
for Computational Linguistics. dinal study of follow predictors on twitter. In Proceedings
[Jin01] Jin, R., & Hauptmann, A. G. (2001). Generation Us- of the SIGCHI Conference on Human Factors in Comput-
ing a Training Corpus. In Computational Linguistics and ing Systems (pp. 821-830). ACM.
Intelligent Text Processing (pp. 208-215). Springer Berlin [Oco10] O’Connor, B., Krieger, M., & Ahn, D. (2010). Tweet-
Heidelberg. Motif: Exploratory Search and Topic Summarization for
[Yaj12] YaJuan, D. U. A. N., WEIF uRu, C. Z., Heung, Z. M., Twitter. In Proceedings of the 4th Int’l AAAI Conference
& Shum Y. (2012). Twitter topic summarization by rank- on Weblogs and Social Media.
ing tweets using social influence and content quality. In
Proceedings of the 24th International Conference on Com-
putational Linguistics (pp. 763-780).
Table 3: Sample Headlines and Stories
HEADLINE KEYWORDS TWEETS
1) #Syria #Homs
#Aleppo Leader
of Syrian
Syria alQaeda Rivals, militant group
leader gives alQaeda, challenges rivals
rivals ultimatum. #Syria, 2) RT: Top
group, al-qaeda leader
(25-02-14 19:00) ultimatum abu khalid al-
Suri was reportedly
killed by a
rival.#Syria
#ukraine Rada
says try
Yanukovich
before Int Crime
Court. Should be
Ukraine
tried by
parliament
Yanukovich, Ukrainians for
wants
Ukraine, crimes against
Yanukovich
parliament, Ukrainians!
tried international
court 2) Yanukovich
court
papers:Snipers who
(25-02-14 18:45)
killed dozens of
protesters came
from Ukraine’s
”omega” special
forces.#euromaiden
Russian President 1) Putin orders
Vladimir Putin troops to prepare
ordered test in case of ’a crisis’
combat readiness Vladimir, in Ukraine as
for troops Putin, tensions step up.
stationed region Yanukovych, Report on The
that touches Russia, 530 now @tv3News
Ukraines Ukraine 2) Russia puts
northern troops on
border alert amid
(26-02-14 17:30) Ukraine tension.
Ukraine leaders Not in my
riot,
disband riot wildest dreams I’d
Ukraine,
police who imagine Arab police
police,
kneel down doing so
unit,
ask forgive- #Ukraine riot
crackdown,
ness from police asking
Kiev,
the people forgiveness
protesters
(26-02-14 17:45) from protesters
The equivalent of
Bitcoin turmoil
war when states are
rumoured 375m theft,
in danger.
theft closes Gox,
major Bitcoin,
Bitcoin
exchange. exchange
exchange fears
(26-02-14 03:30)
$400m theft #bitcoin
bitcoin major
# Business?Exchange
turning point
time, closes as virtual
mtgox exchange
website, money vanishes:
abruptly stops
transactions, ANGRY Bitcoin
trading 774000
being, enthusiasts are
bitcoin reported
Bitcoin protesting the
missing
collapse
(26-02-14 03:45)