News Recommendation and Filter Bubble Jianan Yao Alexander G. Hauptmann Tsinghua University Carnegie Mellon University Beijing, China, 100084 Pittsburgh, PA, USA, 15289 yaojn15@mails.tsinghua.edu.cn alex@cs.cmu.edu on the formation of filter bubbles. In the following two sections we will analyze if and how content-based Abstract and collaborative filtering news recommendation algo- rithms cause the filter bubble problem. Most current Recently many literatures have studied the recommendation systems use a hybrid model which problem of rumor detection on social media takes both news content and user feedback into con- and proposed various automatic detection al- sideration, but it is more reasonable to study them gorithms. In this ongoing work report we ex- separately to get an insight of the problem. ploit the power of the crowd and formulate the reviewer selection problem, which aim to 2 content-based Methods find reliable reviews for a possible rumor. Our reviewer selection scheme can be considered Most content-based news recommendation algorithms complementary to existing methods. We give map users and news into the same feature space and theoretical analysis and provide a greedy algo- calculate their similarity with certain distance metric. rithm with approximation guarantee. We con- The most common approach is to apply an LDA based duct experiments on a Twitter dataset about topic model to generate news representations. rumors, which validates the effectiveness and We train the basic Latent Dirichlet allocation efficiency of our algorithm. (LDA) model [BNJ03] on a public news dataset1 , which includes 142568 articles from 15 media outlets. 1 Introduction The topic number is set to 100. We use the Python topic modeling library gensim2 for implementation. Nowadays people increasingly rely on the Internet to After training we obtain the topic distribution of each learn what is happening around the world. Among news document. tons of stories and pages available online, news rec- Fig. 1 shows the topic distribution of news related ommendation systems provide users with personalized to President Trump from liberal and conservative web- news articles. However, social media and news plat- sites. To visualize the 100-D data we apply Principal forms, seeking to please users, can shunt information Component Analysis (PCA) for dimensionality reduc- that they guess their users will like hearing, but inad- tion. It proves that as for topic distributions, there vertently isolate what they know into their own filter is minor difference between news articles with left- bubbles. [Par11] Rumors and fake news often prop- leaning and right-leaning political stance. agate within filter bubbles and some argue that they We further extract news on two unrelated issues, have affected the outcome of the 2016 U.S. presidential climate change and border wall, and analyze the topic election. distributions of news from different ideological per- Although many articles have investigated the fil- spectives. The result is shown in Fig. 2. ter bubble problem, little attention has been paid to In Fig. 2, articles with opposing views on ”climate recommendation systems themselves. In this paper change” issue (green and yellow dots) occupy the same we examine the role of recommendation algorithms region in the feature space, and articles with differ- Copyright © CIKM 2018 for the individual papers by the papers' ent opinions on ”border wall” have similar properties, authors. Copyright © CIKM 2018 for the volume as a collection which indicates that LDA model cannot distinguish by its editors. This volume and its papers are published under 1 https://www.kaggle.com/snapcrack/all-the-news the Creative Commons License Attribution 4.0 International (CC 2 https://radimrehurek.com/gensim/ BY 4.0). 1 Figure 1: LDA-generated representation of Trump- Figure 2: LDA-generated news representations on two related news from liberal and conservative media out- issues: climate change and border wall. Two op- lets. posing sides of the climate change issue are the pro- different opinions on the same issue. A user who has environment side which emphasizes the hazard of cli- read an article about excessive regulation on business mate change, and the pro-business side which claims is very likely to get recommendation about climate there have already been excessive regulations. Two change. A direct corollary will be pure LDA content- opposing sides of the border wall issue are the pro- based recommendation systems do not lead to the filter freedom side which fears the wall will sow hatred across bubble problem. the country and pro-security side which prioritizes on stopping illegal immigration. 3 Collaborative Filtering Methods retweets and news with no retweets, we get a dataset of Collaborative filtering methods proved to be suc- 7119 users, 2065 news and 55746 retweets (viewed as cessful in many domains, from movie recommenda- positive ratings). Since no negative feedback can be tion to shopping recommendation. Typical collabo- retrieved on Twitter, we randomly choose user-item rative filtering algorithms are built on a user-item- pairs as negative samples. Finally we use Matrix Fac- rating matrix, and use two separate feature spaces torization to generate latent factor vectors for users for users and items. User and item representation and news. Here we set number of factors to 5. We also can be traditional matrix factorization based vec- try other latent dimension numbers and they show a tors [KNK13, Kor08] or neural network based embed- similar pattern. LFM is implemented in Python using dings [WDZE16, HLZ+ 17]. Surprise5 . In this paper, we use the classic LFM (Latent Factor Model) as a representative for collaborative filtering al- Table 1: List of liberal and conservative news outlets gorithms. [KBV09, MS08] LFM models the preference in our dataset. ŷui as the dot product of latent factor vectors pu and liberal conservative qi , representing the user and the item, respectively. CNN Fox News ŷui = p> New York Times Breitbart u qi The Economist The Blaze We collect data from Twitter using Twitter stan- Politico National Review dard API.3 We select 11 popular U.S. news media with Washington Post New York Post different political leanings listed in Table 1. You can Rush Limbaugh refer to Wikipedia pages4 to learn about the liberal- conservative divide in U.S. politics. For each of them Fig. 3 shows the topic distribution of news re- we retrieve the most recent 200 tweets and query for lated to President Trump from liberal and conserva- their retweeters. After removing users with less than 3 tive websites. There is obvious difference between lib- 3 https://developer.twitter.com/en/docs/tweets/search/api- eral and conservative news representations, especially reference/get-search-tweets.html when compared with Fig. 1. 4 https://en.wikipedia.org/wiki/Social liberalism, https://en.wikipedia.org/wiki/Social conservatism 5 http://surpriselib.com/ 2 manual inspection, we find that for California wildfire, nearly all news websites tell exactly the same story and there is no ideological perspective on this event, but news from outlets with different political leaning are still mapped to different regions in the feature space. Articles on California wildfire from liberal media are mapped close to articles on international politics from liberal media. So is conservative media. Up to now we have found an explanation for fil- ter bubbles. Users tend to read (or retweet for Twit- ter) news with similar political leaning with their own, and when an article is published it will first be read by like-minded people, which finally leads to a dead lock. News articles sharing common audience will be mapped to the same region in the feature space, even if the articles are about different topics, and then users Figure 3: LFM-generated representation of Trump- will be recommended with what like-minded people related news from liberal and conservative media out- tends to read, and the filter bubble will be reinforced. lets. Finally we have strong filter bubbles and leave users To further investigate the situation, we choose two isolated from different ideological perspectives. unrelated categories, international politics and Cali- fornia wildfire, and inspect the news representations 4 Conclusion from liberal and conservative media. The result is shown in Fig. 4. In this paper we analyze the role of recommendation systems in filter bubbles. We analyze topic distribu- tions of news under LDA on different issues and from different sources, and discover that typical content- based news recommendation algorithms cannot distin- guish different opinions on the same topic. We analyze news representations under Latent Factor Model and indicate that collaborative filtering algorithms tend to map news from the same outlets or with similar polit- ical leaning into contiguous regions, thus leaving users in filter bubbles. 5 Future Work For content-based methods, we only test on original LDA. In our future work we will try topic modeling combined with sentiment analysis and opinion mining. Figure 4: LFM-generated news representations on two After figuring out where the filter bubbles comes categories: international politics and California wild- from, we should consider how to overcome this prob- fire. Red and blue dots stand for news about California lem. We need to strike a balance between breaking the wildfire from conservative and liberal media outlets re- filter bubbles and still making users enjoy what they spectively. Yellow and green dots stand for news on see. A reinforcement learning framework, which learn international politics from conservative and liberal me- users’ open-mindedness or tolerance on different topics dia outlets respectively. and adjust recommendation policy accordingly, could be a reasonable consideration. Fig. 4. reveals an interesting phenomenon. From the figure it seems that blue and green come from References the same category while red and yellow constitute an- other category, but that is exactly the opposite. The [BNJ03] David M Blei, Andrew Y Ng, and news represented by blue dots and green dots both Michael I Jordan. Latent dirichlet allo- originate from liberal news outlets, but they tell com- cation. Journal of machine Learning re- pletely different stories. So are red and yellow dots. By search, 3(Jan):993–1022, 2003. 3 [HLZ+ 17] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceed- ings of the 26th International Conference on World Wide Web, pages 173–182. In- ternational World Wide Web Conferences Steering Committee, 2017. [KBV09] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009. [KNK13] Santosh Kabbur, Xia Ning, and George Karypis. Fism: factored item similarity models for top-n recommender systems. In Proceedings of the 19th ACM SIGKDD in- ternational conference on Knowledge dis- covery and data mining, pages 659–667. ACM, 2013. [Kor08] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collabora- tive filtering model. In Proceedings of the 14th ACM SIGKDD international con- ference on Knowledge discovery and data mining, pages 426–434. ACM, 2008. [MS08] Andriy Mnih and Ruslan R Salakhutdi- nov. Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257–1264, 2008. [Par11] Eli Pariser. The filter bubble: How the new personalized web is changing what we read and how we think. Penguin, 2011. [WDZE16] Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. Collaborative denoising auto-encoders for top-n recom- mender systems. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 153– 162. ACM, 2016. 4