News Recommendation and Filter Bubble

                              Jianan Yao                                   Alexander G. Hauptmann
                         Tsinghua University                              Carnegie Mellon University
                        Beijing, China, 100084                            Pittsburgh, PA, USA, 15289
                    yaojn15@mails.tsinghua.edu.cn                               alex@cs.cmu.edu


                                                                     on the formation of filter bubbles. In the following
                                                                     two sections we will analyze if and how content-based
                         Abstract                                    and collaborative filtering news recommendation algo-
                                                                     rithms cause the filter bubble problem. Most current
     Recently many literatures have studied the                      recommendation systems use a hybrid model which
     problem of rumor detection on social media                      takes both news content and user feedback into con-
     and proposed various automatic detection al-                    sideration, but it is more reasonable to study them
     gorithms. In this ongoing work report we ex-                    separately to get an insight of the problem.
     ploit the power of the crowd and formulate
     the reviewer selection problem, which aim to
                                                                     2     content-based Methods
     find reliable reviews for a possible rumor. Our
     reviewer selection scheme can be considered                     Most content-based news recommendation algorithms
     complementary to existing methods. We give                      map users and news into the same feature space and
     theoretical analysis and provide a greedy algo-                 calculate their similarity with certain distance metric.
     rithm with approximation guarantee. We con-                     The most common approach is to apply an LDA based
     duct experiments on a Twitter dataset about                     topic model to generate news representations.
     rumors, which validates the effectiveness and                      We train the basic Latent Dirichlet allocation
     efficiency of our algorithm.                                    (LDA) model [BNJ03] on a public news dataset1 ,
                                                                     which includes 142568 articles from 15 media outlets.
1    Introduction                                                    The topic number is set to 100. We use the Python
                                                                     topic modeling library gensim2 for implementation.
Nowadays people increasingly rely on the Internet to
                                                                     After training we obtain the topic distribution of each
learn what is happening around the world. Among
                                                                     news document.
tons of stories and pages available online, news rec-
                                                                        Fig. 1 shows the topic distribution of news related
ommendation systems provide users with personalized
                                                                     to President Trump from liberal and conservative web-
news articles. However, social media and news plat-
                                                                     sites. To visualize the 100-D data we apply Principal
forms, seeking to please users, can shunt information
                                                                     Component Analysis (PCA) for dimensionality reduc-
that they guess their users will like hearing, but inad-
                                                                     tion. It proves that as for topic distributions, there
vertently isolate what they know into their own filter
                                                                     is minor difference between news articles with left-
bubbles. [Par11] Rumors and fake news often prop-
                                                                     leaning and right-leaning political stance.
agate within filter bubbles and some argue that they
                                                                        We further extract news on two unrelated issues,
have affected the outcome of the 2016 U.S. presidential
                                                                     climate change and border wall, and analyze the topic
election.
                                                                     distributions of news from different ideological per-
   Although many articles have investigated the fil-
                                                                     spectives. The result is shown in Fig. 2.
ter bubble problem, little attention has been paid to
                                                                        In Fig. 2, articles with opposing views on ”climate
recommendation systems themselves. In this paper
                                                                     change” issue (green and yellow dots) occupy the same
we examine the role of recommendation algorithms
                                                                     region in the feature space, and articles with differ-
Copyright © CIKM 2018 for the individual papers by the papers'       ent opinions on ”border wall” have similar properties,
authors. Copyright © CIKM 2018 for the volume as a collection        which indicates that LDA model cannot distinguish
by its editors. This volume and its papers are published under           1 https://www.kaggle.com/snapcrack/all-the-news
the Creative Commons License Attribution 4.0 International (CC           2 https://radimrehurek.com/gensim/

BY 4.0).


                                                                 1
Figure 1: LDA-generated representation of Trump-                     Figure 2: LDA-generated news representations on two
related news from liberal and conservative media out-                issues: climate change and border wall. Two op-
lets.                                                                posing sides of the climate change issue are the pro-
different opinions on the same issue. A user who has                 environment side which emphasizes the hazard of cli-
read an article about excessive regulation on business               mate change, and the pro-business side which claims
is very likely to get recommendation about climate                   there have already been excessive regulations. Two
change. A direct corollary will be pure LDA content-                 opposing sides of the border wall issue are the pro-
based recommendation systems do not lead to the filter               freedom side which fears the wall will sow hatred across
bubble problem.                                                      the country and pro-security side which prioritizes on
                                                                     stopping illegal immigration.
3     Collaborative Filtering Methods                                retweets and news with no retweets, we get a dataset of
Collaborative filtering methods proved to be suc-                    7119 users, 2065 news and 55746 retweets (viewed as
cessful in many domains, from movie recommenda-                      positive ratings). Since no negative feedback can be
tion to shopping recommendation. Typical collabo-                    retrieved on Twitter, we randomly choose user-item
rative filtering algorithms are built on a user-item-                pairs as negative samples. Finally we use Matrix Fac-
rating matrix, and use two separate feature spaces                   torization to generate latent factor vectors for users
for users and items. User and item representation                    and news. Here we set number of factors to 5. We also
can be traditional matrix factorization based vec-                   try other latent dimension numbers and they show a
tors [KNK13, Kor08] or neural network based embed-                   similar pattern. LFM is implemented in Python using
dings [WDZE16, HLZ+ 17].                                             Surprise5 .
    In this paper, we use the classic LFM (Latent Factor
Model) as a representative for collaborative filtering al-           Table 1: List of liberal and conservative news outlets
gorithms. [KBV09, MS08] LFM models the preference                    in our dataset.
ŷui as the dot product of latent factor vectors pu and
                                                                                  liberal           conservative
qi , representing the user and the item, respectively.
                                                                                   CNN                Fox News
                         ŷui = p>                                            New York Times          Breitbart
                                 u qi
                                                                              The Economist          The Blaze
   We collect data from Twitter using Twitter stan-                              Politico          National Review
dard API.3 We select 11 popular U.S. news media with                          Washington Post      New York Post
different political leanings listed in Table 1. You can                                            Rush Limbaugh
refer to Wikipedia pages4 to learn about the liberal-
conservative divide in U.S. politics. For each of them                  Fig. 3 shows the topic distribution of news re-
we retrieve the most recent 200 tweets and query for                 lated to President Trump from liberal and conserva-
their retweeters. After removing users with less than 3              tive websites. There is obvious difference between lib-
    3 https://developer.twitter.com/en/docs/tweets/search/api-       eral and conservative news representations, especially
reference/get-search-tweets.html                                     when compared with Fig. 1.
   4 https://en.wikipedia.org/wiki/Social liberalism,

https://en.wikipedia.org/wiki/Social conservatism                      5 http://surpriselib.com/


                                                                 2
                                                                manual inspection, we find that for California wildfire,
                                                                nearly all news websites tell exactly the same story and
                                                                there is no ideological perspective on this event, but
                                                                news from outlets with different political leaning are
                                                                still mapped to different regions in the feature space.
                                                                Articles on California wildfire from liberal media are
                                                                mapped close to articles on international politics from
                                                                liberal media. So is conservative media.
                                                                   Up to now we have found an explanation for fil-
                                                                ter bubbles. Users tend to read (or retweet for Twit-
                                                                ter) news with similar political leaning with their own,
                                                                and when an article is published it will first be read
                                                                by like-minded people, which finally leads to a dead
                                                                lock. News articles sharing common audience will be
                                                                mapped to the same region in the feature space, even if
                                                                the articles are about different topics, and then users
Figure 3: LFM-generated representation of Trump-                will be recommended with what like-minded people
related news from liberal and conservative media out-           tends to read, and the filter bubble will be reinforced.
lets.                                                           Finally we have strong filter bubbles and leave users
   To further investigate the situation, we choose two          isolated from different ideological perspectives.
unrelated categories, international politics and Cali-
fornia wildfire, and inspect the news representations           4   Conclusion
from liberal and conservative media. The result is
shown in Fig. 4.                                                In this paper we analyze the role of recommendation
                                                                systems in filter bubbles. We analyze topic distribu-
                                                                tions of news under LDA on different issues and from
                                                                different sources, and discover that typical content-
                                                                based news recommendation algorithms cannot distin-
                                                                guish different opinions on the same topic. We analyze
                                                                news representations under Latent Factor Model and
                                                                indicate that collaborative filtering algorithms tend to
                                                                map news from the same outlets or with similar polit-
                                                                ical leaning into contiguous regions, thus leaving users
                                                                in filter bubbles.

                                                                5   Future Work
                                                                For content-based methods, we only test on original
                                                                LDA. In our future work we will try topic modeling
                                                                combined with sentiment analysis and opinion mining.
Figure 4: LFM-generated news representations on two                 After figuring out where the filter bubbles comes
categories: international politics and California wild-         from, we should consider how to overcome this prob-
fire. Red and blue dots stand for news about California         lem. We need to strike a balance between breaking the
wildfire from conservative and liberal media outlets re-        filter bubbles and still making users enjoy what they
spectively. Yellow and green dots stand for news on             see. A reinforcement learning framework, which learn
international politics from conservative and liberal me-        users’ open-mindedness or tolerance on different topics
dia outlets respectively.                                       and adjust recommendation policy accordingly, could
                                                                be a reasonable consideration.
   Fig. 4. reveals an interesting phenomenon. From
the figure it seems that blue and green come from
                                                                References
the same category while red and yellow constitute an-
other category, but that is exactly the opposite. The           [BNJ03]     David M Blei, Andrew Y Ng, and
news represented by blue dots and green dots both                           Michael I Jordan. Latent dirichlet allo-
originate from liberal news outlets, but they tell com-                     cation. Journal of machine Learning re-
pletely different stories. So are red and yellow dots. By                   search, 3(Jan):993–1022, 2003.


                                                            3
[HLZ+ 17] Xiangnan He, Lizi Liao, Hanwang Zhang,
          Liqiang Nie, Xia Hu, and Tat-Seng Chua.
          Neural collaborative filtering. In Proceed-
          ings of the 26th International Conference
          on World Wide Web, pages 173–182. In-
          ternational World Wide Web Conferences
          Steering Committee, 2017.
[KBV09]    Yehuda Koren, Robert Bell, and Chris
           Volinsky. Matrix factorization techniques
           for recommender systems.       Computer,
           (8):30–37, 2009.
[KNK13]    Santosh Kabbur, Xia Ning, and George
           Karypis. Fism: factored item similarity
           models for top-n recommender systems. In
           Proceedings of the 19th ACM SIGKDD in-
           ternational conference on Knowledge dis-
           covery and data mining, pages 659–667.
           ACM, 2013.

[Kor08]    Yehuda Koren. Factorization meets the
           neighborhood: a multifaceted collabora-
           tive filtering model. In Proceedings of
           the 14th ACM SIGKDD international con-
           ference on Knowledge discovery and data
           mining, pages 426–434. ACM, 2008.

[MS08]     Andriy Mnih and Ruslan R Salakhutdi-
           nov. Probabilistic matrix factorization. In
           Advances in neural information processing
           systems, pages 1257–1264, 2008.

[Par11]    Eli Pariser. The filter bubble: How the new
           personalized web is changing what we read
           and how we think. Penguin, 2011.
[WDZE16] Yao Wu, Christopher DuBois, Alice X
         Zheng, and Martin Ester. Collaborative
         denoising auto-encoders for top-n recom-
         mender systems. In Proceedings of the
         Ninth ACM International Conference on
         Web Search and Data Mining, pages 153–
         162. ACM, 2016.


                                                         4