Credibility and Transparency of News Sources: Data Collection and Feature Analysis Ahmet Aker Vincentius Kevin University of Duisburg-Essen, Duisburg, Germany and University of Duisburg-Essen University of Sheffield, Sheffield, England Duisburg, Germany aker@is.inf.uni-due.de vincentius.kevin@stud.uni-due.de Kalina Bontcheva University of Sheffield Sheffield, England k.bontcheva@sheffield.ac.uk shaped only by few experts or professional people or institutions but by everyone who has access. Although Abstract this new style of contribution towards web content has led to immense information richness and diverse views The ability to discern news sources based however, it has also brought new challenges. It has on their credibility and transparency is use- stripped the traditional information providers, such as ful for users in making decisions about news news media, from their gate-keeping role [1] and has consumption. In this paper, we release a left the public in a jungle of web content with varying dataset of 673 sources with credibility and quality from reliable and true information to misinfor- transparency scores manually assigned. Upon mation i.e., facts that are not true. acceptance we will make this dataset pub- Misinformation is interchangeably used with the licly available. Furthermore, we compared fea- terms fake news. Douglas et al. refer to fake news tures which can be computed automatically as a “deliberate publication of fictitious information, and measured their correlation with credibil- hoaxes and propaganda” [7], and is similarly defined ity and transparency scores annotated by hu- by others [11]. Furthermore, it is reported that the man experts. Our correlation analysis shows veracity of the information is highly connected to the that there are indeed features which highly publisher, i.e. the source of information [6, 4]. Thus correlate with the manual judgments. instead of performing judgment on e.g. article level such as performed in [12, 8, 14] there are services to as- 1 Introduction sess the sources publishing online news. NewsGuard1 is one of such services. NewsGuard analyses manually The Web has never been as big as it is now. It con- each news publishing source in terms of credibility and tains tremendous amount of information represented transparency and provides detailed information such in form of articles, videos, images, blog and social me- as references and reasoning, and the persons account- dia posts and many other entries. One of the rea- able behind each analysis. The results are made avail- sons for this massive growth is that it is not anymore able to the public via a browser plugin. Copyright © 2019 for the individual papers by the papers’ au- In this paper we use NewsGuard to manually col- thors. Copying permitted for private and academic purposes. lect analyses results of 673 news sources. For each This volume is published and copyrighted by its editors. news source we manually record the overall credibility In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, and transparency scores but also detailed information M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the that led to those overall decisions. We plan to make NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, published at http://ceur-ws.org 1 www.newsguardtech.com this dataset freely available.2 Next,we collect a rich • Left/Right: “moderately to strongly biased to- set of well known metrics/features used by e.g. search war” liberal/ conservative causes, may be untrust- engines to assess the popularity of a web-site and run worthy. correlation analysis between the features and manu- ally assigned NewsGuard scores. Our analysis show • Left-Center/Right-Center: slight to moderate that there are features which highly correlate with the bias toward liberal/conservative causes. NewsGuard scores. This suggests that the manual pro- • Center (Least Biased): minimal bias, most credi- cess done by NewsGuard could be automated. ble media sources. 2 Data Collection • Pro-Science: “These sources consist of legitimate science or are evidence based through the use of 2.1 NewsGuard: Credibility and Trans- credible scientific sourcing. ...” parency Scores • Conspiracy-Pseudoscience: “Sources in the NewsGuard’s team manually reviewed thousands of Conspiracy- Pseudoscience category may publish news agencies, which are mostly based in the US, unverifiable information that is not always sup- to label them with nine criteria. A news agency is ported by evidence. ..” rewarded credibility and transparency scores for each criterion it fulfills. The criteria are listed below. • Questionable Sources: “extreme bias, consistent promotion of propaganda/conspiracies, poor or no Credibility criteria: sourcing to credible information, a complete lack of transparency and/or is fake news.” • Does not repeatedly publish false content (22 points) • Satire: “... humor, irony, exaggeration, or ridicule to expose and criticize people’s stupidity or vices, • Gathers and presents information responsibly (18 ... these sources are clear that they are satire and points) do not attempt to deceive” • Regularly corrects or clarifies errors (12.5 points) • Re-Evaluated Sources: these are sources which • Handles the difference between news and opinion have been updated by MBFC. They are dupli- responsibly (12.5 points) cates, so this category is removed from our anal- ysis. • Avoids deceptive headlines (10 points) We used the sources (in total 2714) from MBFC to Transparency criteria: run over the NewsGuard (see next Section). • Website discloses ownership and financing (7.5 2.3 Collection Procedure points) To collect NewsGuard judgments on the sources col- • Clearly labels advertising (7.5 points) lected from MBFC we performed a manual process. • Reveals who’s in charge (5 points) We installed the NewsGuard as a browser plugin and visited each of the MBFC source. The results shown by • The site provides the names of content creators, the plugin were recorded. For instance for BBC.com, along with either contact information or biograph- NewsGuard lists the results shown in Figure 1. For ical information (10 points) this source we recorded the values for the individual labels as well as the overall NewsGuard score (in this The total of credibility and transparency scores is case 95). If the results are unavailable because News- 100 at maximum, and a news website is considered Guard has not analysed the source, the news source is “safe” if it has at least 60 points. discarded. We performed this procedure for all 2714 news 2.2 News Sources sources available in the nine categories at the time. The list of news sources we used were taken from Me- NewsGuard scores were available only for 673 of them. dia Bias Fact Check (MBFC). MBFC aims to cate- Most of the sources in the “Satire” category were gorize sources by political bias. The categories are as unavailable. The scores were found to agree with follows, with some descriptions (partially) quoted from MBFC’s description of each category - in general, their website3 . least biased and pro-science sources are the most 2 https://github.com/ahmetaker/sourceCredibility credible ones, while extremely biased and conspir- 3 mediabiasfactcheck.com acy/pseudoscience sources can be unreliable. Table 1 3.1 Automatic Features 3.1.1 CheckPageRank CheckPageRank5 (cPR) provides a free online tool which can report page rank score, alexa rank, and a few other domain analysis results for any given web- site. The tool does not provide any exact definition or in- formation on how the scores are calculated. However, cPR provides scores which seem to be taken from non- free services such as Moz SEO and Majestic SEO tools. While these tools highly limits usage for free users to ten queries per month and a few queries per day re- spectively (as of 2019), cPR allows one query every thirty seconds, although it does not provide the full information available in the other tools. Below is the most likely explanation we found for the features provided by cPR, either because the fea- ture name is self-explanatory or the supposed under- Figure 1: NewsGuard on bbc.com lying services give exact or very close scores compared to what is displayed by cPR. shows the average score and standard deviation per category. The counts show how many sources are • Google Page Rank: A score from 0 to 10 which available on NewsGuard out of all that were listed in estimates the importance of the website based on MBFC. the quantity and quality of links to it from other category count µ(score) σ(score) cred. tran. websites. Left 85 / 316 77.16 22.25 57.81 19.35 Left Center 185 / 466 94.32 8.11 72.58 21.74 Center 122 / 404 94.29 8.29 72.20 22.09 • cPR Score: This is shown visually as one of the Right Center 76 / 224 92.01 15.00 70.03 21.97 most important scores in checkpagerank.net, al- Right 60 / 269 61.27 26.82 46.02 15.25 beit without any given definition. We presume Pro Science 27 / 139 93.89 7.51 72.22 21.67 that ‘cPR’ simply stands for ‘checkPageRank’ and Conspiracy 39 / 287 30.09 27.76 16.88 13.21 Fake News 76 / 478 23.55 17.33 12.93 10.46 cPR score is calculated with a proprietary formula Satire* 3 / 131 5.00 4.33 0.00 5.00 or algorithm. Table 1: NewsGuard score per source category and • Citation Flow and Trust Flow: These two scores the break down into credibility (max. 75) and trans- are most probably from Majestic6 , an SEO parency (max. 25). The count shows how many news (Search Engine Optimization) tool. According to sources are available in NewsGuard out of all sources Majestic’s glossary7 , citation flow focuses on the listed in MBFC. *The satire category is not represen- quantity and influential power of links to the web- tative as it has only 3 NewsGuard scores. site, while trust flow focuses on links from man- ually reviewed trusted sites. Majestic seems to have crawled over 600 billion URLs by 2014 [13]. 3 Correlation Analysis • Topic Value: this score also most likely comes In the correlation analysis the automatic features are from Majestic. Majestic provides a “Topical Trust compared to the manually annotated credibility and Flow” score, which, according to their glossary transparency scores to analyze the correlation and pre- “shows the relative influence [...] in any given dictive power of the features. We calculated specif- topic or category.” It is a likely explanation that ically the correlation between each automatic fea- cPR show only the topic for which the website ture against the combined score (3 × credibility + has the best Topical Trust Flow, since the topic transparency) from NewsGuard4 . names and value range are exactly the same in In the followings we outline features we selected as cPR and Majestic. well as the metric used to perform the correlation. 5 checkpagerank.net 4 https://www.newsguardtech.com/ratings/rating-process- 6 majestic.com criteria/ 7 https://majestic.com/help/glossary • Backlinks: External backlinks mean links from 3.1.3 Facebook other websites to the subject website. This ex- cludes internal links, which usually exist to let • Page Likes: the number of Facebook users who users navigate within the same website. likes the Facebook page of the news source, by simply clicking on the like button. Likes informa- • Referring domains: this is the number of domains tion is publicly available. which contains backlink(s) to the subject website. • EDU and GOV backlinks and domains: Majestic • Page Followers: the number of Facebook users also provides the counts of educational and gov- who are following the page, which means any ernmental backlinks and domains. posts by the page will be shown in the users’ home screens. By default, when someone likes a page, • Domain Authority and Page Authority: the Moz8 he automatically follows the page as well. The SEO tool describes these scores as “the rank- user can then “Unfollow” while still keeping the ing potential in search engines based on an al- “Like”. It is also possible to follow a page without gorithmic combination of all link metrics”. While liking it. MozRank is not used directly by search engines, it is similar and correlated to ranking of major 3.2 Pearson Correlation with Logarithmic search engines [16]. We tested a few websites and Transformation confirmed that cPR shows exactly the same scores as Moz. First, we measured the Pearson correlation [3]. Pear- • Spam Score: This most likely represents the Moz son only measures linear relationships. This means if SEO spam flags explained in their website9 . The there is no such relationship Pearson is not a good flags represent internal and external features of choice to compute the correlation. However, one way websites that are indicative of ‘spam websites’ and of overcome this limitation is to convert the data to have been found to be penalized or banned by logarithm form. Therefore, we also applied a logarithm Google. (base 10) on the features before calculating the Pear- son correlation (with “add one” to avoid math error • Alexa Rank: Alexa Rank is described as a pop- for the logarithm of zero) to capture the correlations ularity measure which “is calculated using a pro- which follow the power law rather than linear. prietary methodology that combines a site’s es- We expected features such as backlink counts and timated traffic and visitor engagement over the number of likes in social media to follow the power past three months.”10 law, under the assumption that website links and user • Alexa Reach Rank: this score is based specifically networks in social media follow the pattern of a scale- on the estimated number of people each website free network (preferential attachment) [2]. is able to reach. We also expect behavior of ranking features (e.g. Alexa Rank) to be non-linear. Although it is not nec- • Indexed URLs: This may be the number of URLs essarily logarithmic, ratio would be a better measure indexed by Google, as is commonly provided in than rank difference. By applying a logarithm kernel, SEO tools, but since there is no information pro- only the ratio is now considered, i.e. the difference vided, this is only a guess. between rank 10 and 20 is considered as significant as the difference between ranks 1,000 and 2,000. 3.1.2 Twitter • Number of followers: the number of users on twit- 3.3 Spearman and Kendall Tau Correlations ter.com who “subscribes” to the news’ Twitter ac- count. Posts made on Twitter will appear on the Since Pearson correlation only measures linear cor- followers’ home screen. relation, we have also computed the Spearman and Kendall Tau correlation scores. This may give a bet- • Listed count: a Twitter user can make lists of ter insight on which variables are more predictive of users to personally categorize other users. They the news source quality. can keep the list private or publicly visible. Listed Both Spearman [15] and Kendall Tau [9] are rank- count represents the number of public lists in based correlation measurement, thus they work well on which the Twitter user appears. 8 moz.com monotonous correlations. Spearman does not handle 9 https://moz.com/blog/spam-score-mozs-new-metric-to- tied ranks, which occurs very often in our dataset due measure-penalization-risk to NewsGuard’s scoring method. Therefore, Kendall 10 blog.alexa.com Tau seems to be the better measurement and has been used to sort the rows in the following table. We have One unexpected result is the negative correla- used the tau-b implementation available in scipy 11 . tion between Facebook likes/follows and NewsGuard scores. This may be caused by the availability of paid 4 Correlation Results “like farms” to get fake likes on the platform, such as BoostLikes and SocialFormula. Even legitimate Face- pearson book ad campaigns can result in significant amounts of Feature spear. kend. linear log such fake likes [5]. However, it requires further anal- GOV Backlinks 0.031 0.698 0.656 0.499 ysis of the corresponding Facebook pages to confirm GOV Domains 0.201 0.698 0.627 0.473 this. EDU Backlinks 0.029 0.723 0.612 0.454 One should note that since the dataset comes from EDU Domains 0.305 0.723 0.556 0.408 NewsGuard, it is possible for unpopular news sources Trust Metric* 0.614 0.662 0.542 0.399 to be under-represented. Trust Flow* 0.614 0.662 0.542 0.399 Indexed URLs 0.019 0.584 0.537 0.396 5 Conclusion Topic Value* 0.589 0.641 0.528 0.387 Ref. Domains 0.227 0.622 0.508 0.367 In this paper, we release a dataset of 673 sources with Google PageRank 0.581 0.575 0.448 0.354 credibility and transparency scores manually assigned. Citation Flow* 0.523 0.538 0.449 0.327 The scores come from NewsGuard’s plugin. We man- Domain Authority 0.603 0.588 0.445 0.325 ually accessed the plugin for 2714 news sources pub- cPR Score 0.589 0.584 0.445 0.323 lished by Media Bias Fact Check and recorded for Ext. Backlinks 0.073 0.567 0.449 0.322 those 673 detailed scores about credibility and trans- Page Authority* 0.521 0.524 0.397 0.284 parency NewsGuard provides. For the remaining 2042 Global Rank -0.338 -0.427 -0.323 -0.232 sources NewsGuard did not have judgments. Alexa Reach -0.327 -0.414 -0.313 -0.224 We also extracted a rich set of features and per- Alexa USA* -0.379 -0.360 -0.276 -0.197 formed a correlation analysis. Our results show that Facebook Likes -0.076 -0.149 -0.229 -0.163 there are strong correlations between the NewsGuard Twitter Listed 0.131 0.388 0.231 0.162 scores and features analysed in this work. This in- Twitter Followers 0.098 0.327 0.228 0.161 dicates that the credibility and transparency scoring Facebook Follows -0.073 -0.147 -0.225 -0.160 could be automated. Spam Score -0.051 0.025 0.038 0.032 In our future work we aim to perform such a step and create a regression model to automatically pre- Table 2: Feature correlation with NewsGuard score: dict the credibility and transparency scores. This will Pearson, Spearman and Kendall tau-b coefficients. allow to obtain credibility scores for any source that is so far not judged by NewsGuard. Note since our Table 2 shows the correlation scores (Pearson, features are language independent this will allow us Spearman, Kendall tau) between each feature and the to obtain credibility scores for any source reporting in total score from NewsGuard. Grey values indicate sta- any language. We also plan to use the output of our tistically non-significant correlations with p_value ⩾ regression models as information nutrition label within 0.00069 (using Bonferroni correction, counting both NewsScan12 [10]. Pearson tests as one). As expected, applying logarithmic transformation yields big improvements on the Pearson correlation Acknowledgements scores. There were six features which have not met This work was partially supported by the European our expectation in terms of whether logarithm kernel Union under grant agreement No. 825297 WeVer- would improve the linear correlation (marked with a ify (http://weverify.eu) and the Deutsche Forschungs- star), even though the differences in these cases are gemeinschaft (DFG, German Research Foundation) - relatively small (< 0.05). GRK 2167, Research Training Group “User-Centred Many of the automatically retrievable features have Social Media”. a significant correlation with the NewsGuard scores. Notably, backlinks and referring domains, especially References from government and educational websites, are very good indicators of trustable sources. Trust Metric and [1] Baly, R., Karadzhov, G., Alexandrov, D., Trust Flow also work very well, confirming that seeded Glass, J., and Nakov, P. Predicting factuality network graphs can be useful in practice. of reporting and bias of news media sources. arXiv preprint arXiv:1810.01765 (2018). 11 https://docs.scipy.org/doc/scipy- 0.15.1/reference/generated/scipy.stats.kendalltau.html 12 www.news-scan.com [2] Barabási, A.-L., and Pósfai, M. Network sci- [14] Rashkin, H., Choi, E., Jang, J. Y., ence. Cambridge University Press, Cambridge, Volkova, S., and Choi, Y. Truth of varying 2016. shades: Analyzing language in fake news and po- litical fact-checking. In Proceedings of the 2017 [3] Benesty, J., Chen, J., Huang, Y., and Co- Conference on Empirical Methods in Natural Lan- hen, I. Pearson correlation coefficient. In Noise guage Processing (2017), pp. 2931–2937. reduction in speech processing. Springer, 2009, pp. 1–4. [15] Spearman, C. The proof and measurement of association between two things. The American [4] Burgoon, J. K., and Hale, J. L. The funda- Journal of Psychology 15, 1 (1904), 72–101. mental topoi of relational communication. Com- munication Monographs 51, 3 (1984), 193–214. [16] Themistoklis Mavridis, A. L. S. Identify- [5] De Cristofaro, E., Friedman, A., Jourjon, ing valid search engine ranking factors in a web G., Kaafar, M. A., and Shafiq, M. Z. Paying 2.0 and web 3.0 context for building efficient seo for likes?: Understanding facebook like fraud us- mechanisms. Engineering Applications of Artifi- ing honeypots. In Proceedings of the 2014 Confer- cial Intelligence 41 (2015), 75–91. ence on Internet Measurement Conference (New York, NY, USA, 2014), IMC ’14, ACM, pp. 129– 136. [6] Demchenko, Y., Grosso, P., De Laat, C., and Membrey, P. Addressing big data issues in scientific data infrastructure. In Collaboration Technologies and Systems (CTS), 2013 Interna- tional Conference on (2013), IEEE, pp. 48–55. [7] Douglas, K., Ang, C. S., and Deravi, F. Farewell to truth? conspiracy theories and fake news on social media. The Psychologist (2017). [8] Hardalov, M., Koychev, I., and Nakov, P. In search of credible news. In International Conference on Artificial Intelligence: Methodol- ogy, Systems, and Applications (2016), Springer, pp. 172–180. [9] Kendall, M. G. The treatment of ties in rank- ing problems. Biometrika 33, 3 (1945), 239–251. [10] Kevin, V., Högden, B., Schwenger, C., Sa- han, A., Madan, N., Aggarwal, P., Ban- garu, A., Muradov, F., and Aker, A. Infor- mation nutrition labels: A plugin for online news evaluation. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) (Brussels, Belgium, Nov. 2018), Association for Computational Linguistics, pp. 28–33. [11] Klein, D. O., and Wueller, J. R. Fake news: A legal perspective. Journal of Internet Law 20, 10 (2017), 6–13. [12] Markowitz, D. M., and Hancock, J. T. Lin- guistic traces of a scientific fraud: The case of diederik stapel. PloS one 9, 8 (2014), e105937. [13] Pardeep Sud, M. T. Linked title mentions: A new automated link search candidate. Sciento- metrics 101 (2014), 1831–1849.