=Paper= {{Paper |id=Vol-2482/paper12 |storemode=property |title=Predicting Tomorrow’s Headline using Twitter Deliberations |pdfUrl=https://ceur-ws.org/Vol-2482/paper12.pdf |volume=Vol-2482 |authors=Roshni Chakraborty,Abhijeet Kharat,Apalak Khatua,Sourav Kumar Dandapat,Joydeep Chandra |dblpUrl=https://dblp.org/rec/conf/cikm/ChakrabortyKKDC18 }} ==Predicting Tomorrow’s Headline using Twitter Deliberations== https://ceur-ws.org/Vol-2482/paper12.pdf
            Predicting Tomorrow’s Headline using Twitter
                           Deliberations

            Roshni Chakraborty                            Abhijeet Kharat                 Apalak Khatua
                 IIT Patna                                   IIT Patna                   XLRI Jamshedpur
                    India                                      India                           India
           roshni.pcs15@iitp.ac.in                   abhijeet.mtcs17@iitp.ac.in          apalak@xlri.ac.in

                               Sourav Kumar Dandapat                    Joydeep Chandra
                                      IIT Patna,                            IIT Patna,
                                         India                                 India
                                  sourav@iitp.ac.in                     joydeep@iitp.ac.in



                                                                 drastically in recent times. Todays millennial genera-
                                                                 tion is not only emotionally but also physically tied to
                         Abstract                                their smartphones and tablets. This has severely af-
                                                                 fected the newspaper industry. All leading newspapers
     Predicting the popularity of a news article is              across the globe have reported a sharp drop in their
     a challenging task. Existing literature mostly              print circulation. So, the future of this industry lies on
     focused on article contents and polarity to                 the digital platform. The competition in this newspa-
     predict the popularity. However, existing re-               per industry is not anymore about sending the print
     search has not considered the users prefer-                 version to the remotest corner of the country. The
     ence towards a particular article. Understand-              challenge of this digital platform is to understand the
     ing users preference is an important aspect                 latent psychological aspects of the users. If a newspa-
     for predicting the popularity of news articles.             per fails to satisfy the user, then within the next few
     Hence, we consider social media data, from                  seconds she will switch to another news-related app.
     the Twitter platform, to address this research              This will directly impact the ad revenue of a news
     gap. In our proposed model, we have con-                    outlet. Between various news related apps, and vari-
     sidered the users involvement as well as the                ous social media platforms, users these days are spoilt
     users reaction towards an article to predict                for choice. Customer loyalty is a concept of a bygone
     the popularity of the article. In short, we are             era in this digital age, and the customers preferences
     predicting tomorrows headline by probing to-                are also not homogeneous. In brief, the phenomenal
     days Twitter discussion. We have considered                 growth of online news consumption and innumerable
     300 political news articles from the New York               news sources has significantly increased the competi-
     Post, and our proposed approach has outper-                 tion among news media outlets. Further, the contin-
     formed other baseline models.                               uous influx of newsworthy events further aggravates
                                                                 the situation. Thus, for media outlets, the need of
                                                                 the hour is to develop an automated system that can
1    Introduction
                                                                 help them to predict which of the todays headlines will
Gone are those days when an office going New Yorker              maintain its popularity tomorrow.
used to board the subway with a folded newspaper in
                                                                    Existing literature has attempted to address this.
his hand. Reading morning newspapers on New Yorks
                                                                 However, this stream of research mostly explored var-
subway is becoming outdated. Things have changed
                                                                 ious features and contents of the articles and the ti-
Copyright © CIKM 2018 for the individual papers by the papers'   tle of the articles [FVC15, LWZ+ 17]. Prior stud-
authors. Copyright © CIKM 2018 for the volume as a collection    ies considered the subjectivity and polarity of con-
                                                                 tents [FVC15, KWHR16], the sentiment of the head-
by its editors. This volume and its papers are published under
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
line [RBdM+ 15], and so on. In other words, these            2    Related Works
works focus broadly into the articulation aspects of
an article. Few studies also considered the impor-           Predicting the popularity of news article is a well-
tance of an event to predict the popularity of news          researched area. However, the genesis of this re-
article [SAMA17]. One of the major shortcomings of           search lies in the prior works on news recommender
the above approach is that hypothetically, two articles      system. News recommender system research mostly
might have similar feature and polarity, but the reac-       focused on the personal preference of an individ-
tion of readers might be different. It would be similar      ual user [LXG+ 14]. So, understanding the user-
to comparing an apple to an orange even though they          level latent political leaning, or bias towards a cer-
might be nearly similar in shape and weight. We argue        tain sport or a team, can help to predict the suit-
that probing the social media platform can hint which        able news article for an individual user. For in-
is orange, and which is apple. Social media platforms        stance, prior studies considered users historical pref-
can hint about the users preference.                         erences [WLC+ 10], social network data [DFMGL12,
   Nowadays social media platforms, such as Twitter,         AGHT11], user feedback [SBZ11] or combination of
generate an enormous amount of user-generated data,          both user preferences and user feedback [LWL+ 11,
and many times this social media platforms become            LCLS10, MGÁRLGMM13], but this stream of studies
the mirror of the society. Existing literature has suc-      struggled due to lack of adequate data. Moreover, the
cessfully explored the Twitter data to predict the elec-     users interest can vary over time, and historical data is
tion outcome [KKGC15], to understand social move-            not available for new users. Thus, prior studies have
ments [KK16a], to tackle disasters and epidemic out-         attempted to mitigate the challenges by considering
breaks [KK16b]. Therefore, we argue that understand-         opinions of social influencers [LXG+ 14], topic or tem-
ing the finer nuances of Twitter deliberation can be         poral [XXLZ14] relationships between news items and
beneficial to predict the popularity of news article on      users [LL13] or analyzing user communities [ZLHL13].
digital platforms. This paper attempts to address this       These approaches yield better results in comparison to
research gap.                                                initial studies, but still, the accuracy of filtering news
   Prior studies noted that analyzing social media           articles for a newspaper is not satisfactory.
platform could shed light regarding the popular-                In comparison to news recommender system for an
ity [KHGPS16] and the life cycle of various news ar-         individual user, predicting the popularity of a news
ticle [Cas13]. However, these studies have considered        article is a complex task because an efficient predic-
tweets, which have exclusively mentioned the URL of          tion model needs to account for the heterogeneity of
news article. In fact, these studies failed to probe the     users. More importantly, the summation of individ-
richness of the Twitter platform by restricting them to      ual user preference would not be the proxy for soci-
a very small sub-sample of tweets with the URL of a          etal acceptance For instance, it is easy to predict what
specific news article. On the contrary, our study takes      a Democrat or Republican will prefer to read on the
a more holistic approach than these studies and con-         digital platform but the task becomes complex if we
siders both users involvement with a news article and        try to predict what political news will engage both
user reaction towards a specific news article. However,      Democrats and Republicans. So, a dominant stream of
the biggest challenge for this approach is to identify the   prior works focused on the content of the news article
relevant tweets for a specific news article. Therefore,      and article headline and employed machine learning al-
we have developed an iterative and adaptive algorithm        gorithms [SAM+ 16] to predict the popularity of a news
that considers both textual and semantic attributes to       article. For instance, existing literature considered dif-
identify the relevant tweets. Our user involvement as-       ferent features of an article [VCLDD17, KWHR16],
pect considers various count measures, such as a to-         such as textual [FVC15] and temporal features, to
tal number of tweets and average number of retweets,         predict its popularity. Prior studies noted that arti-
count of hashtags, the cumulative number of unique           cle content features, such as the length of the article,
users as well as influential users, and so on. Also, our     the time of publishing, category or genre of the arti-
user reaction indices consider linguistic aspects of the     cle, the author of the article and so on, can predict
Twitter discussion, such as variances in sentiment and       the popularity of a news article [LWZ+ 17]. Another
emotion for a particular news article. For the sake of       set of studies also considered the linguistic attributes
robustness, we have considered various machine learn-        of an article [KMJO16, KYS+ 17] or the presence of
ing algorithms and an exhaustive set of baseline mod-        important entities within an article [SS16] to investi-
els based on prior studies. Our findings on the basis of     gate the issue. Existing literature also noted that the
300 news items strongly suggest that patterns of Twit-       headline of an article itself [KFKN15] and the polarity
ter deliberations can outperform other baseline models       within the headline [RBdM+ 15] could be important
in predicting the popularity of the news articles.           input variables for the predicting the popularity of a
news article.                                              climate change will uniformly affect all users. There-
    Another set of works highlighted the event impor-      fore, it is easy to predict the reaction of users. How-
tance to predict the popularity of a news article. For     ever, the political news might not uniformly engage
instance, Setty et al. [SAMA17] ranked news articles       and affect all users because of their ideological hetero-
by linking them to a chain of recent news events. Sim-     geneity. In this study, we have considered 300 political
ilarly, other studies tried to explore the event impor-    news, from the New York Post, during the period July
tance by combining articles using topic similarity from    2016 to September 2016.
Wikipedia [MB16] or by considering the causal rela-           We have considered the Twitter platform for collect-
tionships [KVW14]. However, this approach has limi-        ing the social media data. Twitter allows free access
tations for new upcoming events or for an event which      to approximately 1% of total tweets (in a random fash-
is losing relevance among readers. In these scenarios,     ion) using the streaming API. To probe our research
we argue that probing users behavioral pattern on so-      question, we have considered the tweets related to a
cial media platform can hint about the popularity of       particular news article. Extracting the relevant tweets
news article.                                              for particular political news is a challenging task. To
    To the best of our knowledge, there is hardly          address this, we have developed an adaptive algorithm
any study which considered social media platform for       that has considered both content (similar keyword
predicting the popularity of news article. Some of         mapping) features and context (same hashtag) fea-
the prior studies considered the users behavior on a       tures of tweets to extract the related tweets of a news
news media outlet and argued that engagement of            article. Following prior studies [CBDC17], as an initial
users could predict the popularity of a news arti-         step, we have considered a set of preliminary hash-
cle [TADAF14, TLA+ 11]. However, it is worth not-          tags that have threshold keywords overlap with the
ing that the news media outlet represents a minuscule      representative (by top 10 TF-IDF) keywords within
of digital platform readers. This is one potential re-     the news article. In other words, these preliminary
search gap in the existing literature. Popular social      hashtags, which we have initially considered to crawl
media platforms, such as Twitter, not only provides        tweets for a particular news article, are a bag of hash-
users an option to share their views but also allows to    tags on the basis of the articles seed tweets [CBDC17].
reply or endorse the views of others by retweeting. In     However, there are certain limitations to this approach
other words, Twitter provides a platform for its users     because users can use multiple hashtags for a particu-
to engage in a deliberation. Interaction of users on the   lar news article on the social media platform but not
Twitter platform can shed light about users engage-        all of them might be unique to that particular news
ment with a particular news article [OCDA15]. Thus,        article. For instance, social media users have used
this paper attempts to predict the popularity of news      multiple hashtags for the following news article titled
article using Twitter data. Some of the earlier works      GOP blasts Obama 400 million dollars secret ransom
consider initial twitter reactions [MTR14, CEHPS14]        paid to Iran as follows: #whitehouse, #trump2016,
or content and structural features [LZZ15]. However,       #chicago, #irandeal, #obamabetrayus and so on (as
as we mentioned, they have only considered tweets          shown in Table 3). The last two hashtags are more
that has news related URLs. Thus, none of the prior        specific about the news article in comparison to others.
studies considered the richness of Twitter data. So,       Hence, we need to consider this in our data collection
this paper not only considers the user-level involve-      as well as analysis.
ment (by using count measures of tweets, retweets,            To address the above concern, we have collected all
number of unique users and other parameters) but also      hashtags related to all political news articles published
probes user-level reaction towards a particular news       in the previous one month (with respect to the pub-
article (by understanding the various linguistic aspects   lication date of the article we are considering in our
of their tweets).                                          analysis). This process has generated a bag of hash-
                                                           tags. From this bag of hashtags, we have identified a
                                                           set of hashtags, which were frequently used by Twitter
3   Data Collection
                                                           users and labeled them as generic hashtags. Conse-
To address our research problem, we have considered        quently, we have labeled #whitehouse, #trump2016,
the front-page political news of the New York Post,        #chicago as generic hashtags for the above article be-
which is one of the most popular newspapers in the         cause these hashtags were used by Twitter users for
United States. It has experienced a whopping 500%          other issues/news article also.
growth in the last five years with 331 million page           Next, we have developed an automated system for
views in March 2018. Predicting the popularity of po-      identifying hashtags specific to a news article. We
litical news is the most challenging in comparison to      have filtered out the preliminary hashtags as the hash-
other genres of news. For instance, a news article on      tags those were mentioned in a tweet T, and ful-
Table 1: The table shows title of 4 news articles, sample tweets related to the news article and the hashtags(both
generic(HG ) and article specific(HA ) related to the news article.
 SNo        News Article                         Sample Tweets                                Hashtags
                                1. It’s not like Obama ever earned any money       #whitehouse
        GOP          blasts                                                        (HG ),#irandeal(HA ),
                                ... he gave 400 million in cash to Iran....
        Obama’s        400                                                         #pressecretary(HG ),
                                #irandeal
   1    million     dollars
                                2. I strongly oppose the Raskin-supported for-     #obamabetrayus(HA ),
        secret ransom paid                                                         #trump2016(HG ),
                                eign policy toward #iran. we must not pay
        to Iran                                                                    #chicago(HG ), #cnn(HG )
                                ransom to a dangerous terror regime.
                                1.      What visa enabled melania trump            #whitehouse              (G),
                                to work in the U.S.?        #theplotthickens,      #irandeal(HA ),
        Melania Trump: I
                                #immigration, #trump                               #pressecretary(HG ),
   2    have never lived in
                                2.      If Melania Trump broke immigra-            #obamabetrayus(HA ),
        the US illegally
                                tion laws, the best punishment is ...              #trump2016(HG ),
                                #melaniaImmigration #nevertrump                    #chicago(HG ), #cnn(HG )
                                1. Hillary Clinton reckless emails outed an
        Hillary to blame        Iranian nuclear scientist who was executed by      #crookedhillary(HG ),
        for Iranian scien-      Iran for treason #neverhillary                     #neverhillary(HG ),
   3
        tist’s hanging, gen-    2. #crookedhillary server has emails dis-          #shortcircuit(HG ),
        eral says               cussing nuclear scientist #executed by iran        #hillary(HG )
                                #shortcircuit
                                1. Shhh...You’re not supposed2 know about
        Obamacare hikes
                                health insurance rate hikes until after elec-
        has families strug-                                                        #repealobamacare(HA ),
   4                            tions! vote #trump #repealobamacare
        gling   to   afford                                                        #trump(HG ), #maga(HG )
                                2. #repealobamacare Obama will take our
        insurance
                                money
filled the threshold criteria of keywords matching with     We have extracted the users name from our Twitter
the news article [CBDC17]. We define article-specific       corpus and identified around 1 million unique users
hashtags as those hashtags that were mentioned in T         who have tweeted at least once for our sample of 300
but not in our list of generic hashtags. For instance,      news article. Next, we have crawled their last 3200
Thus, the article 1 (as shown in Table 3), we have          tweets and profile-related information. In our model,
labeled #irandeal and #obamabetrayus as article-            we have considered whether a user is influential or not.
specific hashtags. Similarly, the article-specific hash-    If a user has more than 1000 followers, then we have
tag for the news article 4 (as shown in Table 3), i.e.,     considered them as an influential user. To sum up,
Obamacare hikes has families struggling to afford in-       we have considered 1.8 million tweets for our 300 news
surance was #repealobamacare.                               articles made by around 1 million unique users.
   To check the accuracy of this approach, we have pro-
vided around 130 news articles along with all the hash-     4    Proposed Approach
tags to three annotators. We have labeled a hashtag         For predicting the popularity of a news article, we have
as an article-specific hashtag if the majority of annota-   considered two categories of social media data namely,
tors have marked that particular hashtag as specific to     user involvement indices and user reaction indices. We
that article, or otherwise labeled it as a generic hash-    argue that the popularity of a news article among the
tag. We observed that our proposed approach yields          social media users (which is a proxy for digital plat-
an accuracy of 89%in identifying an article specific        form readers) can be captured by analyzing the atten-
hashtags. (as shown in Table 3) reports a few sample        tion that a news article is receiving and the linguistic
news articles and corresponding article-specific (HA )      content of the discussion on the Twitter platform on
and generic (HG ) hashtags. After identifying the arti-     the very day of its publication. In brief, the former cat-
cle specific hashtags, we use these hashtags to extract     egory considers various user-level tweet statistics, and
further tweets related to that news article.                the latter employs natural language processing tech-
   Next, we have extracted the user level information.      niques to understand the linguistic aspects of the social
In other words, we have extracted the information re-       media discussions. The following sections narrate how
lated to users who had participated in the political        we have operationalized the involvement and reaction
discussion related to any of these 300 news articles.       indices.
Table 2: The table shows 5 News Articles along with the whether it was published next day (Pn+1 ) and User
Involvement Indices related to each article. HP , HG and HA represents the number of preliminary hashtags,
generic hashtags and article specific hashtags and Tt , Tr and Tf represents the number of tweets, retweets and
favourites and u, qu represent the users and unique users of the tweets related to a news article.
      Title of a few sample news article on
  SNo                                                   Pn+1    (HP /HG /HA )        (Tt /Tr /Tf )       (u/qu)
      nth day
      Suicide bombing at Pakistani hospital
   1                                                    Yes          16/8/8       1319/42150/37150      1319/1240
      kills at least 63
      Trump to propose big tax breaks in eco-
   2                                                    Yes          20/12/8      208/44151/37929        208/180
      nomic plan
      Trump gives Post columnist a shout-out
   3                                                     No          5/5/0             10/0/0             10/10
      in economic speech
      Obama commutes sentences for record-
   4                                                     No          13/1/12           13/2/0             13/13
      breaking 214 prisoners
      Furious GOP leaders plot to get Trump
   5                                                    Yes          39/8/31        344/223/ 150         344/289
      on track
4.1 User Involvement Indices                                   their personal/official twitter handle then immediately
                                                               that tweet will get retweeted by hundreds of their fol-
We capture the attention of Twitter users for a partic-        lowers. more than 1000 followers. To sum up, in addi-
ular news article by considering the user involvement          tion to generic user statistics we have also considered
through three aspects: tweet statistics, user statistics       the fraction of affected users and influential users for
and hashtag statistics. Under the tweet statistics cat-        each news article as an input variable in our model.
egory, we have considered the number of tweets, the               Next, we considered the number of article-specific
number of retweets and the number of favorites re-             hashtags on the Twitter platform as a metric to gauge
ceived by a particular news article on the very day            the involvement of social media users. Intuitively, it
of publication. These three statistics represents the          can be argued that higher user involvement with a
user response towards a particular article. We ob-             news article would generate higher number of article-
serve that there is a significant variance in user involve-    specific hashtags. For instance, the news article Trump
ment. Some news article receives hundreds of tweets            gives Post Columnist a shout-out in economic speech
and retweets whereas another news article merely re-           didnt generate article specific hashtags. On the con-
ceives ten to twenty tweets. Thus, we have normalized          trary, the news article Trump to propose big tax breaks
the number of tweets received by a particular article          in economic plan has created 8 article specific hashtags
by dividing it with the maximum number of possible             (as shown in Table 2). Thus, we have considered the
tweets that an article can have in a day.                      total number of article-specific hashtags as an input
   Intuitively, the number of users get involved with a        variable in our proposed model.
particular news article is a good predictor of the pop-
ularity of the news article. Furthermore, we note that
                                                               4.2    User Reaction Indices
some users get more involved with a particular news
article, and they tweet multiple times in a day. How-          As we mentioned earlier, natural language process-
ever, it is worth noting that 10 tweets from 10 different      ing (NLP) techniques allowed us to go beyond various
users, in comparison to 10 tweets from 1 particular            count-based user-level measures and to probe the lin-
user, is a better proxy to gauge the popularity of a           guistic content of Twitter deliberations to understand
news article. On the contrary, if a user tweets, about         the cognitive involvement of users with a particular
a particular news article, for more than once, then it         news article. This cognitive involvement of users can
also indicates his high involvement of that user with          be a good proxy to predict the popularity of a news
that particular news article. So, we have considered           article. So, we have employed NLP techniques, such
these finer variances in our analysis. We have consid-         as sentiment and emotion analysis, to gauge the user
ered the fraction of users, who have tweeted more than         reactions towards a particular news article. We have
once for a particular news article, as affected users.         considered three indicators to capture the user reac-
   Subsequently, it is also important to note whether          tion namely, sentiment variance, emotion variance and
a user is influential on the social media platform or          argumentativeness index.
not. In other words, an influential person can be an              We argue that differences of opinion would lead to
opinion leader on a social media platform. For in-             higher debates and discussion on the Twitter platform.
stance, if personalities, such as Barack Obama or Don-         For instance, most social media users would agree with
ald Trump, endorse a particular news article through           a news article such as Global warming would be a se-
Table 3: The table shows 5 news articles along with whether the article was published next day(Pn+1 ), sentiment
variance(SV ) and emotion variance(EV ) of the tweets related to each news article.
           SNo         Title of a few sample news article on the nth day               Pn+1       SV      EV
            1        Suicide bombing at Pakistani hospital kills at least 63            Yes       0.67   43.19
            2         Trump to propose big tax breaks in economic plan                  Yes       0.90   20.30
            3     Trump gives Post columnist a shout-out in economic speech             No        0.00    0.00
            4     Obama commutes sentences for record-breaking 214 prisoners            No        0.00    0.00
            5          Furious GOP leaders plot to get Trump on track                   Yes       0.37   41.51
rious threat in the coming decades and it might create       variance indicates that users are displaying different
a discussion but not debates. However, a hypothet-           emotions towards a news article. We have calculated
ical news article such as President Trump is failing         the emotional variance (EV) as follows:
to take appropriate policy measures to control global
warming would probably lead to a debate between the                               P8                     2
                                                                                     i=1 (e(i) − m(e))
Democrats and Republicans. Republican will try to                          EV =
                                                                                              n
discard this view, whether Democrats will try to jus-
tify this view. Consequently, the popularity of this
particular news article will also go up. We are at-             In the above formula, n is the number of emotion
tempting to capture this in our proposed model.              categories which is 8 [MT10, MT13], e(i) is the fraction
   Following prior studies, such as Vader Sentiment          of tweets with ith emotion, andthevalueof m(e)isN 8
Analyzer [HG14] and TextBlob [LKH+ 14], we have              where N is the total number of tweets related to a
calculated the average sentiment score of a tweet. We        particular article. Here, the highest emotion variance
have identified all tweets specific to a particular news     indicates that tweet corpus for a particular news article
article and classified whether the tweet is positive or      represents multiple emotion categories. For instance,
negative. Next, we have considered the sentiment vari-       in response to the immigration issue related news, a
ance of all tweets related to a particular news article      Republican, who believes that strong immigration law
to understand the differences in opinions. We have           would protect American jobs, might display joy. On
calculated the sentiment variation (SV) as follows:          the contrary, a social activist, who thinks otherwise,
                                                             might display her anger to the same news article.
                          | (P C − N C) |
               SV = 1 −                                      5     Data Analysis
                          | (P C + N C) |
                                                             5.1   Preparation of Gold Standard
                                                             For our analysis, we need to know whether a particu-
   PC is the number of positive tweets for a news ar-        lar news article of the nth day is followed by another
ticle, and NC is the number of negative tweets for the       subsequent article on (n + 1)th day. It is important to
same news article. The sentiment variance is highest         note that on (n+1) the day the title or the content of
when the count of positive tweets and negative tweets        the subsequent article can differ significantly from the
are equal for a news article, and the sentiment vari-        previous day. For instance, on nth day, the hypotheti-
ance decreases when there is only (or higher number          cal title of a news article can be: Why Brexit matters
of) positive/negative tweets. In other words, having         for the American Corporate Sector? However, on the
an equal number of positive and negative sentiment           (n + 1)th day the issue will continue, but the title can
indicates that users are from two ideologically oppo-        be: American Corporates are reluctant to invest in the
site camps. On the contrary, only positive or negative       UK. So, it requires a contextual understanding to pre-
tweets indicate that users are ideologically homoge-         pare the database for our analysis. Thus, we employed
neous.                                                       three annotators and provided them with a particular
   Next, we also considered the emotional content of a       news article of nth day and all the news articles of
tweet. We employ the NRC emotion lexicon[MT10,               (n + 1)th day for manual annotation. We have asked
MT13] to classify a tweet among various emotion              our annotators to mark a news article either as 1 if the
classes such as anger, anticipation, trust, disgust, fear,   same news gets covered on the subsequent or (n + 1)th
joy and surprise. Similar to our sentiment variance          day and 0 otherwise. For our analysis purpose, we
analysis, we have considered all tweets specific to a        have considered the labeling on the basis of the ma-
particular news article and classify them into vari-         jority of the annotators. We have done this for all 300
ous categories of emotions. Intuitively, high emotional      news articles that we considered for our final analysis.
Table 4: The table shows the baseline models along with the features considered in each of the baseline models
  SNo               Baseline Models                            List of Features/Input Variables
                                                no of words in the article; the rate of non-stop words; day of
                                                the week on which it got published; published on weekend
      1     Article Content +Article Polarity
                                                or not;no of entities in the news article; average word length
                                                of the article
                                                Polarity score of the article; the rate of positive and nega-
                                                tive words per 100 words; the rate of positive and negative
      2             Article Polarity            words per 100 words with non-neutral words, the average
                                                polarity of positive and negative words; min. and max. the
                                                polarity of positive and negative words
                                                no of words in the title; the rate of non-stop words in the
      3       Title Content +Title Polarity     title;no of entities in the title; the average word length of
                                                the title
                                                Polarity score of the title; the rate of positive and negative
                                                words in the title; the rate of positive and negative words
      4              Title Polarity             with non-neutral words in the title; average polarity of pos-
                                                itive and negative words in the title; min. and max. the
                                                polarity of positive and negative words in the title
                                                no of days a news article related to the event was published,
      5            Event importance
                                                no of articles of the event was published


                       Table 5: The comparison of the proposed approach with the baselines
                                                                 RFC     SVM      CART
                                    Proposed Approach
                                         Precision                91.4    94.6     83.3
                                           Recall                 84.2    83.3     85.7
                                         F1-Score                 87.6    88.6     84.9
                               F1-Score of Baseline Models
                             Article Content+Article Polarity     86.8    86.9     74.9
                                      Article Polarity            86.8   84.14     81.7
                              Title Content + Title Polarity      86.0    87.8     82.5
                                       Title Polarity             85.3    84.9     83.5
                                     Event Importance             81.5    87.4      84
5.2       Baseline Models                                   popularity. Here, our analysis is restricted only to the
                                                            title of the article. Prior studies portray certain dif-
As we discussed in our literature review, a plethora of     ferences in terms of the number of features. However,
studies have tried to predict the popularity of news ar-    in our studies, we have tried to consider an exhaustive
ticle [KMJO16, KFKN15, KYS+ 17, RBdM+ 15, SS16].            set of features for first four baseline models: article
However, this stream of literature is broadly classified    content and polarity, title content and polarity. The fi-
into five categories as follows: article content and po-    nal baseline model is the event importance [SAMA17].
larity, title content and polarity, and event importance    The event importance tries to capture the dominance
categories(as shown in Table 4). We have considered         of a topic/issue in comparison to others. So, the event
all these five prediction models as our baseline models.    importance of a news article is calculated by the num-
    Following the prior studies [KFKN15, RBdM+ 15,          bers of similar articles that get published on consec-
KVW14], we have extracted both the content and po-          utive days. A pair of news article will be considered
larity of the article to predict whether the article will   as a similar article if it crosses the threshold among
get published on the next day or not. Here, we em-          the list of entities and bi-grams features between two
ployed NLP techniques to the understand the overall         articles. Prior studies noted that event importance is
sentiment of the article, usages of positive or negative    also an indicator to predict the popularity of a news
words within the article, the length of the article, and    article.
so on. Similarly, we have also considered the content
and polarity of the title of the article to predict its
6    Results and Discussions                                 time horizon. Secondly, we have considered the po-
                                                             litical news of the New York Post. In other words,
We have employed Random Forest Classified (RFC),
                                                             we have tested our proposed approach in the political
Support Vector Machine (SVM), Gradient Boosting
                                                             sphere of the United States. So, future studies need
Classifier (GBC), and Classification and Regression
                                                             to probe the efficacy of our model for other genres of
Trees (CART) algorithms for our analysis. We have
                                                             news in other countries. The biggest challenge will be
applied these four classifiers on our article dataset both
                                                             to extrapolate this approach to a context where native
for the five baseline models as well as our proposed
                                                             language is not English. Finally, we have considered a
models. We have considered ten-fold cross-validation
                                                             few fundamentalmachine-learning algorithms. Future
for our analysis. We have repeated our experiments
                                                             studies need to consider advanced deep learning based
multiple times and found our results are consistent.
                                                             models to see the accuracy of our model.
We have reported the same in Table 5. Our proposed
model has outperformed all five baseline models for
all three classifiers. Our F1-score for SVM and RFC          References
classifiers are marginally better than the CART clas-        [AGHT11]           Fabian Abel, Qi Gao, Geert-Jan
sifier. Broadly, the SVM classifier has outperformed                            Houben, and Ke Tao. Analyzing
other classifiers not only for our proposed model but                           user modeling on twitter for per-
also for baseline models.                                                       sonalized news recommendations.
                                                                                User Modeling, Adaption and Per-
7    Conclusions                                                                sonalization, pages 1–12, 2011.
The advent of information and communication tech-            [Cas13]            Carlos Castillo. Traffic predic-
nology has affected the newspaper industry severely in                          tion and discovery of news via
last few decades. Seamlessly connected various com-                             news crowds. In Proceedings of the
munication channels are generating a huge volume of                             22nd International Conference on
information. Moreover, the digital platform is becom-                           World Wide Web, pages 853–854.
ing a crowded place. Multiple news outlets are strug-                           ACM, 2013.
gling to grab a larger share of this platform. Therefore,
selecting a potentially popular news article is becom-       [CBDC17]           Roshni Chakraborty,        Maitry
ing a daunting task for the journalists. This leads to                          Bhavsar, Sourav Dandapat, and
the requirement of an automated system, which can                               Joydeep Chandra.       A network
efficiently select the news article that will most likely                       based stratification approach for
draw the maximum attention of users on the digital                              summarizing relevant comment
platform. To address this, prior studies focused mostly                         tweets of news articles. In In-
on the content and polarity of the news article to pre-                         ternational Conference on Web
dict the popularity of news articles. However, these                            Information Systems Engineering,
studies failed to capture the latent psychological as-                          pages 33–48. Springer, 2017.
pects of users. Thus, our proposed approach is trying        [CEHPS14]          Carlos Castillo, Mohammed El-
to gauge the users perception from the social media                             Haddad, Jürgen Pfeffer, and Matt
discussions. We have consideredthe Twitter platform                             Stempeck. Characterizing the life
for our study. Our proposed model has incorporated                              cycle of online news stories using
users involvement and reaction towards a particular                             social media reactions. In Pro-
news article. In short, as our title suggests that we                           ceedings of the 17th ACM confer-
are trying to predict tomorrows popular headline by                             ence on Computer supported coop-
considering todays discussion on Twitter platform.We                            erative work & social computing,
have employed various machine-learning algorithms to                            pages 211–223. ACM, 2014.
test the accuracy of our proposed approach. We ob-
serve that our proposed approach ensures higher accu-        [DFMGL12]          Gianmarco De Francisci Morales,
racyin comparison to other baseline models. Consid-                             Aristides Gionis, and Claudio Luc-
ering Twitter discussion for predicting the popularity                          chese.    From chatter to head-
of news article is the core contribution of this study.                         lines: harnessing the real-time web
However, there are certain shortcomings of our pro-                             for personalized news recommen-
posed mode which future research needs to address.                              dation. In Proceedings of the fifth
Firstly, we have considered a small sample of 300-news                          ACM international conference on
article for a relatively shorter period. Future studies                         Web search and data mining, pages
in this area should consider a larger sample and longer                         153–162. ACM, 2012.
[FVC15]     Kelwin Fernandes, Pedro Vinagre,                   sciences, pages 1676–1685. IEEE,
            and Paulo Cortez. A proactive in-                  2015.
            telligent decision support system
            for predicting the popularity of on-   [KMJO16]    Joon Hee Kim, Amin Mantrach,
            line news. In Portuguese Con-                      Alejandro Jaimes, and Alice Oh.
            ference on Artificial Intelligence,                How to compete online for news
            pages 535–546. Springer, 2015.                     audience: Modeling words that at-
                                                               tract clicks. In Proceedings of the
[HG14]      Clayton J Hutto and Eric Gilbert.                  22nd ACM SIGKDD International
            Vader: A parsimonious rule-based                   Conference on Knowledge Discov-
            model for sentiment analysis of so-                ery and Data Mining, pages 1645–
            cial media text. In Eighth interna-                1654. ACM, 2016.
            tional AAAI conference on weblogs
            and social media, 2014.                [KVW14]     Erdal Kuzey, Jilles Vreeken, and
[KFKN15]    Sawa Kourogi, Hiroyuki Fujishiro,                  Gerhard Weikum. A fresh look on
            Akisato Kimura, and Hitoshi                        knowledge bases: Distilling named
            Nishikawa. Identifying attractive                  events from news. In Proceedings
            news headlines for social media. In                of the 23rd ACM International
            Proceedings of the 24th ACM In-                    Conference on Conference on In-
            ternational on Conference on In-                   formation and Knowledge Man-
            formation and Knowledge Man-                       agement, pages 1689–1698. ACM,
            agement, pages 1859–1862. ACM,                     2014.
            2015.
                                                   [KWHR16]    Yaser Keneshloo, Shuguang Wang,
[KHGPS16]   Danielle K Kilgo, Summer Har-                      Eui-Hong Han, and Naren Ra-
            low, Vı́ctor Garcı́a-Perdomo, and                  makrishnan. Predicting the pop-
            Ramón Salaverrı́a. A new sensa-                   ularity of news articles. In Pro-
            tion? an international exploration                 ceedings of the 2016 SIAM Inter-
            of sensationalism and social media                 national Conference on Data Min-
            recommendations in online news                     ing, pages 441–449. SIAM, 2016.
            publications.    Journalism, page
            1464884916683549, 2016.                [KYS+ 17]   Nagendra      Kumar,      Anusha
                                                               Yadandla, K Suryamukhi, Neha
[KK16a]     Apalak Khatua and Aparup                           Ranabothu, Sravani Boya, and
            Khatua. Leave or remain? de-                       Manish Singh. Arousal prediction
            ciphering brexit deliberations                     of news articles in social media.
            on twitter. In 2016 IEEE 16th                      In International Conference on
            international conference on data                   Mining Intelligence and Knowl-
            mining workshops (ICDMW),                          edge Exploration, pages 308–319.
            pages 428–433. IEEE, 2016.                         Springer, 2017.
[KK16b]     Aparup Khatua and Apalak
            Khatua. Immediate and long-term        [LCLS10]    Lihong Li, Wei Chu, John Lang-
            effects of 2016 zika outbreak: a                   ford, and Robert E Schapire.
            twitter-based study. In 2016 IEEE                  A contextual-bandit approach to
            18th International Conference on                   personalized news article recom-
            e-Health Networking, Applications                  mendation. In Proceedings of the
            and Services (Healthcom), pages                    19th international conference on
            1–6. IEEE, 2016.                                   World wide web, pages 661–670.
                                                               ACM, 2010.
[KKGC15]    Aparup Khatua, Apalak Khatua,
            Kuntal Ghosh, and Nabendu              [LKH+ 14]   Steven Loria, P Keen, M Hon-
            Chaki. Can# twitter trends pre-                    nibal, R Yankovsky, D Karesh,
            dict election results?   evidence                  E Dempsey, et al. Textblob: sim-
            from 2014 indian general elec-                     plified text processing. Secondary
            tion. In 2015 48th Hawaii in-                      TextBlob: Simplified Text Process-
            ternational conference on system                   ing, 2014.
[LL13]           Lei Li and Tao Li. News recom-                    Applications,   40(17):6735–6741,
                 mendation via hypergraph learn-                   2013.
                 ing: encapsulation of user behav-
                 ior and news content. In Pro-        [MT10]       Saif M Mohammad and Peter D
                 ceedings of the sixth ACM inter-                  Turney. Emotions evoked by com-
                 national conference on Web search                 mon words and phrases: Using me-
                 and data mining, pages 305–314.                   chanical turk to create an emo-
                 ACM, 2013.                                        tion lexicon. In Proceedings of the
                                                                   NAACL HLT 2010 workshop on
[LWL+ 11]        Lei Li, Dingding Wang, Tao Li,                    computational approaches to anal-
                 Daniel Knox, and Balaji Padman-                   ysis and generation of emotion in
                 abhan. Scene: a scalable two-stage                text, pages 26–34. Association for
                 personalized news recommenda-                     Computational Linguistics, 2010.
                 tion system. In Proceedings of
                 the 34th international ACM SIGIR     [MT13]       Saif M Mohammad and Peter D
                 conference on Research and devel-                 Turney. Crowdsourcing a word–
                 opment in Information Retrieval,                  emotion association lexicon. Com-
                 pages 125–134. ACM, 2011.                         putational Intelligence, 29(3):436–
                                                                   465, 2013.
[LWZ+ 17]        Caiyun Liu, Wenjie Wang, Yuqing
                 Zhang, Ying Dong, Fannv He, and      [MTR14]      Nuno Moniz, Luı́s Torgo, and
                 Chensi Wu. Predicting the pop-                    F Rodrigues. Improvement of news
                 ularity of online news based on                   ranking through importance pre-
                 multivariate analysis. In Com-                    diction. In Proc. KDD Workshop
                 puter and Information Technology                  on Data Science for News Publish-
                 (CIT), 2017 IEEE International                    ing (NewsKDD), page 6, 2014.
                 Conference on, pages 9–15. IEEE,
                 2017.                                [OCDA15]     Alexandra     Olteanu,      Carlos
                                                                   Castillo, Nicholas Diakopoulos,
[LXG+ 14]        Chen Lin, Runquan Xie, Xinjun                     and Karl Aberer.        Comparing
                 Guan, Lei Li, and Tao Li. Person-                 events coverage in online news and
                 alized news recommendation via                    social media: The case of climate
                 implicit social experts. Informa-                 change.    In Proceedings of the
                 tion Sciences, 254:1–18, 2014.                    Ninth International AAAI Con-
[LZZ15]          Qian Liu, Mi Zhou, and Xin Zhao.                  ference on Web and Social Media,
                 Understanding news 2.0: A frame-                  number       EPFL-CONF-211214,
                 work for explaining the number                    2015.
                 of comments from readers on on-
                                                      [RBdM+ 15]   Julio Reis, Fabricio Benevenuto,
                 line news. Information & Manage-
                                                                   P Vaz de Melo, Raquel Prates,
                 ment, 52(7):764–776, 2015.
                                                                   Haewoon Kwak, and Jisun An.
[MB16]           Arunav     Mishra    and   Klaus                  Breaking the news: First impres-
                 Berberich.    Leveraging seman-                   sions matter on online news. In
                 tic annotations to link wikipedia                 ICWSM15: Proceedings of The In-
                 and news archives. In European                    ternational Conference on Weblogs
                 Conference on Information Re-                     and Social Media, 2015.
                 trieval, pages 30–42. Springer,
                 2016.                                [SAM+ 16]    R Shreyas, DM Akshata, BS Ma-
                                                                   hanand, B Shagun, and CM Ab-
[MGÁRLGMM13] Alejandro          Montes-Garcı́a,                   hishek. Predicting popularity of
              Jose Marı́a Álvarez-Rodrı́guez,                     online articles using random for-
              Jose Emilio Labra-Gayo, and Mar-                     est regression. In Cognitive Com-
              cos Martı́nez-Merino. Towards a                      puting and Information Process-
              journalist-based news recommen-                      ing (CCIP), 2016 Second Interna-
              dation system: The wesomender                        tional Conference on, pages 1–5.
              approach. Expert Systems with                        IEEE, 2016.
[SAMA17]    Vinay Setty, Abhijit Anand,            [XXLZ14]   Zhengyou Xia, Shengwu Xu,
            Arunav Mishra, and Avishek                        Ningzhong Liu, and Zhengkang
            Anand.       Modeling event im-                   Zhao. Hot news recommendation
            portance for ranking daily news                   system from heterogeneous web-
            events. In Proceedings of the Tenth               sites based on bayesian model.
            ACM International Conference                      The Scientific World Journal,
            on Web Search and Data Mining,                    2014, 2014.
            pages 231–240. ACM, 2017.
                                                   [ZLHL13]   Li Zheng, Lei Li, Wenxing Hong,
[SBZ11]     Yanir Seroussi, Fabian Bohnert,                   and Tao Li. Penetrate: Person-
            and Ingrid Zukerman. Person-                      alized news recommendation us-
            alised rating prediction for new                  ing ensemble hierarchical cluster-
            users using latent factor models. In              ing. Expert Systems with Applica-
            Proceedings of the 22nd ACM con-                  tions, 40(6):2127–2136, 2013.
            ference on Hypertext and hyperme-
            dia, pages 47–56. ACM, 2011.

[SS16]      Pedro Saleiro and Carlos Soares.
            Learning from the news: Predict-
            ing entity popularity on twitter. In
            International Symposium on Intel-
            ligent Data Analysis, pages 171–
            182. Springer, 2016.

[TADAF14]   Alexandru Tatar, Panayotis Anto-
            niadis, Marcelo Dias De Amorim,
            and Serge Fdida. From popularity
            prediction to ranking online news.
            Social Network Analysis and Min-
            ing, 4(1):174, 2014.

[TLA+ 11]   Alexandru Tatar, Jérémie Leguay,
            Panayotis     Antoniadis,       Ar-
            naud Limbourg, Marcelo Dias
            de Amorim, and Serge Fdida.
            Predicting the popularity of online
            articles based on user comments.
            In Proceedings of the International
            Conference on Web Intelligence,
            Mining and Semantics, page 67.
            ACM, 2011.

[VCLDD17]   Steven Van Canneyt, Philip Ler-
            oux, Bart Dhoedt, and Thomas
            Demeester. Modeling and pre-
            dicting the popularity of online
            news based on temporal and
            content-related features. Multime-
            dia Tools and Applications, pages
            1–28, 2017.

[WLC+ 10]   Jia Wang, Qing Li, Yuanzhu Peter
            Chen, Jiafen Liu, Chen Zhang, and
            Zhangxi Lin. News recommenda-
            tion in forum-based social media.
            In AAAI, 2010.