=Paper=
{{Paper
|id=Vol-2079/paper2
|storemode=property
|title=Visualizing Polarity-based Stances of News Websites
|pdfUrl=https://ceur-ws.org/Vol-2079/paper2.pdf
|volume=Vol-2079
|authors=Masaharu Yoshioka,Myungha Jang,James Allan,Noriko Kando
|dblpUrl=https://dblp.org/rec/conf/ecir/YoshiokaJAK18
}}
==Visualizing Polarity-based Stances of News Websites==
Visualizing Polarity-based Stances of News Websites Masaharu Yoshioka Myungha Jang James Allan Hokkaido University UMass Amherst Sapporo-shi, Hokkaido, Japan Amherst, MA, USA yoshioka@ist.hokudai.ac.jp {mhjang, allan}@cs.umass.edu Noriko Kando National Institute of Informatics (NII) Chiyoda-ku, Tokyo, Japan kando@nii.ac.jp type of user, because content is the primary factor in selecting articles, is exposed to news from more diverse Abstract sources, which demonstrates a wider array of political stances. Users must therefore use their own judgment We develop a novel framework that helps iden- to selectively digest what they read, especially for con- tify potential bias in news websites to sup- troversial topics. port users who are exposed to news articles Many users judge the trustworthiness of new web- with a wide variety of political leanings. We sites based on their political bias. Hence, we propose propose a polarity-based stance (PS), a vec- a novel framework that represents the bias of news tor that represents how often a website pub- websites toward a particular topic as a vector. Us- lishes articles that are positive or negative ing this framework, we then visualize stances of news with regard to a topic. We derive PS using websites toward a given topic. For this, we define a the GDELT database and visualize the news polarity-based stance, a vector that represents bias to- websites’ stances. We demonstrate the utility ward a particular topic of a website using the polarity of our framework via a case study of the 2016 of stances. This allows us to visualize the stance of US Presidential Election. news websites, guiding users for the potential bias of the articles published by the websites. We demon- 1 Introduction strate the usefulness of our framework via the case study of 2016 US President Election using the GDELT There are two types of users when it comes to their database1 . pattern of news navigation. The first type already has particular news websites that they trust and actively use by accessing them directly for news. Such web- sites tend to demonstrate the same political stances or 2 Polarity-based Stances leanings as their users. As a result, the articles that −→ they read are likely ones that already share their ide- We formally define a polarity-based stance, P S w , as ologies. The other type, those who are less politically a two-dimensional vector that denotes the stance of a engaged, use a news aggregation website that shows a website w. We first assume that each article of the compiled list of news articles from various sources. A website has one of three stances: positive, negative, −→ key difference in the two approaches is that the latter or neutral. We let P S w = [p, n] where p is the ra- tio of positively-stanced articles and n is the ratio of Copyright c 2018 for the individual papers by the papers’ negatively-stanced articles for a particular topic. Note authors. Copying permitted for private and academic purposes. that the stance has been identified beforehand. We This volume is published and copyrighted by its editors. discuss how to use the GDELT database to derive this In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, vector. B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18 Workshop at ECIR, Grenoble, France, 26-March-2018, pub- lished at http://ceur-ws.org 1 https://www.gdeltproject.org 2.1 Dataset 3 Case Study The GDELT database is one of the largest news article We demonstrate the utility of our approach via a case repositories collected by the Google Jigsaw project. It study of the 2016 US Presidential Election around two is a useful resource for multifaceted analysis for news topics: Donald Trump and Hillary Clinton. To visu- articles because it has a large amount of data and con- alize the polarity-based stances for these topics, we tains the metadata including the source website that estimate the set of news articles on each topic using a are automatically extracted from various NLP algo- simple Boolean query. When an article references both rithms for the crawled articles [YK16]. Trump and Clinton, there is ambiguity about which We use tone, one type of automatically generated topic is indicated by the tone. We therefore identify −→ metadata, to derive P S w . Tone refers to the average the set of articles that exclusively references only one attitude of the article, which is computed by the differ- of the topic to compute the polarity-based stances (see ence between the percentage of positive and negative Table 2). terms in the document[Pro15]. Calculation of polarity Table 2: The numbers of articles for the boolean score based on the term matching is simple and it is queries of “Donald Trump”(DT) and “Hillary Clin- better to use more sophisticated methodology [RR15]. ton”(HC) (The numbers in the parenthesis indicates However, due to the large numbers of the articles for the total number of articles that contain DT and HC) analysis, it is almost impossible for the GDELT users Query # of articles to crawl the all text of the articles and calculate scores DT - HC 677,307 (1,516,225) for them. For the case study analysis later, we use arti- HC - DT 388,162 (1,227,080) cles from the GDELT database published on the 2016 DT or HC 838,918 US Presidential Election during a three month period that includes voting day (see Table 1). Table 3 shows distributions of tone (-100 to 100) in the articles retrieved by DT-HC and HC-DT as queries Table 1: The description on the article dataset in the using their number. For both queries, numbers of arti- GDELT database used. cles for negative tone are larger than one for positive, but the difference is not so large in general2 . So we set Period Sep 1, 2016 - Nov 30, 2016 the value of σ = 1 in equation 1 for this experiment. # of Articles 22.4M (0.2M per a day) However, it is better to check how σ affects the final # of News Websites 44,624 results in the future research. Table 3: Distribution of tone (using number of articles) 2.2 Deriving Polarity-based Stances in the retrieved articles −→ Tone DT-HC HC - DT We compute P S w using the tone score provided by the [−100, −3] 188,709 89,283 GDELT database. Let d be a news article published (−3, −2] 95,665 51,781 by a news website w and t be the tone of d. We classify (−2, −1] 109,575 67,006 the document stance sd into one of three classes: pos- (−1, 0] 123,231 74,878 itive (1), neutral (0), and negative (-1). The stance is (0, 1) 65,554 42,080 derived from t given a threshold σ using the equation [1, 2) 46,999 31,618 [2, 3) 23,528 15,510 1 t>σ [3, 100] 24,046 16,006 sd = 0 −σ < t < σ (1) −1 t < σ Figure 1 and 2 show the scatter plot of polarity- based stances of various news websites for the Trump −→ and Clinton topics. In these plots, we include news We then define a polarity-based stance (P S w ) for a website (w) using the equation websites that published more than 30 articles for the particular topic. Each circle indicates a news web- d∈wτ (1[sd = 1]) (1[sd = −1]) −→ P P site with a radius that signifies the number of articles. P S w (τ ) = , d∈wτ The top 20 news websites that published the most ar- |w| |w| (2) ticles exclusively on Trump and Clinton are indicated where wτ is a set of articles on τ published by w. By by colored circles. Note that a new website with a plotting these stances on a graph, users can compare small number of articles is shown as a point. stances of different news websites. To visualize the bias of the websites (toward Trump In addition, bias can be identified by comparing or Clinton), we plot the absolute difference of positive stances of the similar topics or one with a particular 2 Most of the articles have their tone values between -3 to 3 topic and general topic. (DT-HC:69%, HC-DT:73%) and negative articles ratio for Trump and Clinton in Figure 3. We let Diff(τ ) to be the absolute differ- −→ ence between the two components of P S w (τ ). We plot Diff(T rump) and Diff(Clinton) for compari- son (See Figure 3). The websites whose bias towards 1 iheart.com the two topics are the same are plotted on the line of yahoo.com freerepublic.com (Diff(T rump = Diff(Clinton)). The points at the ap.org 0.8 reuters.com top left of the plot are the articles that are positively- Negative Article Ratio(Trump) newsviewsnreviews.com wn.com washingtonpost.com stanced towards Clinton, and the ones at the bottom dailymail.co.uk 0.6 alltechnews.org right are positively-stanced towards Trump. The plot huffingtonpost.com avauncer.com helps us identify the news websites whose polarity- bloomberg.com 0.4 washingtonexaminer.com einnews.com based stances are completely different between the two sfgate.com foxnews.com topics. For example, thebostonpilot.com has (0.15, contacto-latino.com 0.2 chron.com 0.27) for ”Trump”, and (0.16, 0.65) for ”Clinton” and princegeorgecitizen.com sci-tech-today.com as (0.02, 0.90) for ”Trump”, and 0 0 0.2 0.4 0.6 0.8 1 (0.41, 0.25) for ”Clinton”. It is important to take into Positive Article Ratio(Trump) account such bias when such big difference happens. Figure 1: The polarity-based stances of the Trump 4 Conclusion topic visualized in a scatter plot 1 iheart.com In this paper, we propose a framework to visualize yahoo.com freerepublic.com ap.org stances in the dimensions of polarity of news websites 0.8 reuters.com to identify a potential bias in the articles that are pub- Negative Article Ratio(Hillary) newsviewsnreviews.com wn.com washingtonpost.com lished by them. We define a vector named Polarity- dailymail.co.uk 0.6 alltechnews.org based Stance and demonstrate the utility via a case huffingtonpost.com avauncer.com bloomberg.com study of 2016 U.S. Presidential Eleciton, and that the 0.4 washingtonexaminer.com GDELT database is a useful resource for this type of einnews.com sfgate.com foxnews.com analysis. As a future work, we plan to apply our frame- contacto-latino.com 0.2 chron.com work to a variety of topics for evaluation. We observe princegeorgecitizen.com that some topics generally have a higher positive, or 0 negative articles than the others. We plan to study 0 0.2 0.4 0.6 0.8 1 Positive Article Ratio(Hillary) how to take this factor into account to visualize stances in an useful way. Figure 2: The polarity-based stances of the Clinton topic visualized in a scatter plot Acknowledgment 1 iheart.com yahoo.com This work was partially supported by JSPS KAKENHI freerepublic.com ap.org reuters.com Grant Number 16H01756. 0.5 newsviewsnreviews.com Positive - Negative (Hillary) wn.com washingtonpost.com dailymail.co.uk alltechnews.org References huffingtonpost.com 0 avauncer.com bloomberg.com [Pro15] GDELT Project. The gdelt global knowledge washingtonexaminer.com einnews.com graph (gkg) data format codebook v2.1, 2015. sfgate.com -0.5 foxnews.com contacto-latino.com chron.com [RR15] Kumar Ravi and Vadlamani Ravi. A sur- princegeorgecitizen.com vey on opinion mining and sentiment anal- -1 -1 -0.5 0 0.5 1 ysis: Tasks, approaches and applications. Positive - Negative (Trump) Knowledge-Based Systems, 89:14 – 46, 2015. Figure 3: Diff(Trump) and Diff(Clinton) to compare [YK16] Masaharu Yoshioka and Noriko Kando. Com- their polarity-based stances parative analysis of gdelt data using the news site contrast system. In The first International Workshop on Recent Trends in News Informa- tion Retrieval (NewsIR), 2016.