Corpus of news articles annotated with article-level sentiment Ahmet Aker, Hauke Gravenkamp, Sabrina J. Mayer University of Duisburg-Essen, Germany firstName.lastName@uni-due.de Marius Hamacher, Anne Smets, Alicia Nti University of Duisburg-Essen, Germany firstName.lastName@stud.uni-due.de Johannes Erdmann, Julia Serong, Anna Welpinghus Francesco Marchi Technical University of Dortmund, Germany Ruhr University Bochum, Germany firstName.lastName@tu.dortmund.de firstName.lastName@rub.de 1 Introduction Abstract Nowadays, the amount of online news content is im- mense and its sources are very diverse. For the read- ers and other consumers of online news who value bal- Research on sentiment analysis is in its ma- anced, diverse, and reliable information, it is necessary ture status. Studies on this topic have pro- to have access to additional information to evaluate the posed various solutions and datasets to guide available news articles. For this purpose, Fuhr et al. [7] machine-learning approaches. However, so far propose to label every online news article with infor- the sentiment scoring is restricted to the level mation nutrition labels to describe the ingredients of of short textual units such as sentences. Our the article and thus give the reader a chance to eval- comparison shows that there is a huge gap uate what she is reading. This concept is analogous between machines and human judges when to food packages where nutrition labels help buyers in the task is to determine sentiment scores of their decision-making. The authors discuss 9 different a longer text such as a news article. To close information nutrition labels including sentiment. The this gap, we propose a new human-annotated sentiment of a news article is subtly reflected by the dataset containing 250 news articles with sen- tone and effective content of a writer’s words [5]. Fuhr timent labels at article level. Each article is et al. [7] conclude that knowing about an article’s level annotated by at least 10 people. The articles of sentiment could help the reader to judge the credi- are evenly divided into fake and non-fake cate- bility and whether it is trying to deceive the reader by gories. Our investigation on this corpus shows relying on emotional communication. that fake articles are significantly more senti- Sentiment analysis is a mature research direction mental than non-fake ones. The dataset will and has been summarized by several overview papers be made publicly available. and books [13, 3, 4]. Commonly, sentiment is com- puted on a small fraction of text such as a phrase or Copyright © 2019 for the individual papers by the papers’ au- sentence. Using this strategy, authors of [11, 14, 1] thors. Copying permitted for private and academic purposes. analyze for instance Twitter posts. To compute senti- This volume is published and copyrighted by its editors. ment over a text, such as a news article that spans In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the over several sentences, [9] use the aggregated aver- NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, age sentiment score of the text’s sentences. However, published at http://ceur-ws.org our current study shows that this does not align with the human perception of sentiment. If there are only, Table 1: Textual statistics about articles in the e.g. two sentences in the article which are sentimen- dataset. tally loaded and the remaining sentences are neutral, a sentence-based sentiment scorer will label the arti- fake non-fake cle as not sentimental or will assign a low sentiment text length min 820 720 score. On the contrary, our study shows that humans max 10062 12959 may consider the entire article as highly sentimental median 2576 3003 even if there are only 1-2 sentences that are highly mean 2832.4 4124.4 sentimental. In this work, we propose to release a dataset con- sentences min 6 6 taining 250 news articles with article-level sentiment max 88 144 labels.1 These labels were assigned to each article by median 22 27 at least 10 paid annotators. To our knowledge, this is mean 24.4 36.1 the first article-level sentiment labeled corpus. We be- sentence average words min 11.0 8.0 lieve this corpus will open new ways of addressing the max 35.7 36.7 sentiment perception gap between humans and ma- median 19.8 19.5 chines. Over this corpus, we also run two automatic mean 20.6 19.9 sentiment assessors and show that their scores do not correlate with human-assigned scores. Annotators were recruited from colleagues and In addition, our articles are split into fake (125) friends and were encouraged to refer the annotation and non-fake (125) articles. We show that at the arti- project to their acquaintances. They were free to rate cle level, fake articles are significantly more sentimen- as many articles as they liked and were compensated tal than the non-fake ones. This finding supports the with 3.5€ (or 3£ if they were residents of the UK) per assumption that sentiment will help readers to distin- article. The recruitment method and relatively high guish between credible and non-credible articles. monetary compensation were chosen to ensure high In the following, we will first describe the dataset data quality. annotated with sentiment at article level (Section 2). In Section 3, we present inter-rater agreement among Sentiment was rated in two different ways. First, the annotators, the analysis of sentiment provided for annotators were asked to rate textual qualities of the fake and non-fake articles, as well as a qualitative anal- given article that indicate sentiment, for instance, The ysis of articles with low and high sentiment scores. article contains many words that transport particu- In Section 4, we provide results about our correlation larly strong emotions.. These qualities were measured analysis between human sentiment scores and those by five properties on a 5-Point Rating Scale, labeled obtained automatically. Finally, we discuss our find- Strongly Disagree to Strongly Agree. Afterwards, an- ings and conclude the paper in Section 5. notators were asked to rate sentiment directly on a percentage scale (Overall, how emotionally charged is the article? Judge on a scale from 0-100 ), 100 indi- 2 Dataset cating high sentiment intensity and 0 indicating low We retrieved the news articles annotated in this work sentiment intensity. from FakeNewsNet [15], a corpus of news stories di- We opted for the two-fold annotation approach to vided into fake and non-fake articles. To determine generate sentiment scores that could be used to train whether a story is fake or not, the FakeNewsNet au- machine-learning models as well as sentiment indica- thors extracted articles and veracity scores from two tors that could provide insights as to why and how prevalent fact-checking sites PolitiFact 2 and Gossip- people rate the level of sentiment of an article. In the Cop 3 . We sampled 125 fake and 125 non-fake articles present work, however, we only analyze the percentage from this corpus. All articles are dealing with polit- scores for sentiment. When referring to annotations, ical news, mostly the 2016 US presidential election. we refer to these sentiment scores. The other senti- Table 1 lists textual statistics about the articles. ment variables are not discussed in this current work Each news article was rated between 10 and due to spacial constraints. 22 times (mean = 15.524, median = 15) and Note that annotators did not annotate the senti- each annotator rated 1 to 250 articles (mean = ment polarity, e.g. ”highly positive” or ”slightly nega- 42.185, median = 17). tive”, but only the sentiment intensity, e.g. ”high” or ”low”. 1 https://github.com/ahmetaker/newsArticlesWithSentimentScore In this scheme, highly positive and highly neg- 2 https://www.politifact.com/ ative articles receive the same score. We chose this 3 https://www.gossipcop.com/ annotation scheme since article-level polarity seems less informative for an entire article: In cases where Table 2: Intraclass Correlation Values a single article praises one position and condemns an- other, giving an overall polarity score is ambiguous 95% CI and sentence-level polarity scores may be more infor- N Raters Unit ICC Lower Upper mative. The notion of sentiment intensity is still different total 250 10 average .88 .86 .90 from subjectivity. A subjective statement contains per- single .42 .37 .48 sonal views of an author whereas an objective article fake 125 10 average .76 .67 .81 contains facts about a certain topic. Both subjective non-fake 125 10 average .90 .87 .92 and objective statements may or may not contain sen- timent [12]. For example, ”the man was killed” ex- 3.2 Annotation Distribution presses a negative sentiment in an objective fashion, The dataset contains 3788 sentiment score annota- while ”I believe the earth is flat” is a subjective state- tions, ranging between 0 and 100. The mean score ment expressing no sentiment. For an investigation of is 49.92 with a standard deviation of 32.54. When article level subjectivity, see [2]. looking at all articles, scores are mostly uniformly dis- tributed with minor peaks at the maximum and min- 3 Analysis of Sentiment Scores imum values (see Figure 1). The distribution changes First, we measure differences in inter-rater agreement when dividing the articles into fake and real ones. Fake for fake and non-fake articles in order to see whether articles receive higher scores (mean = 61.50) than the annotators agree on the judgments or not. We non-fake ones (mean = 38.69). We found a significant also analyze the distribution of sentiment ratings to difference (t(3786) = 22.99, p < .001) of medium mag- see whether there are differences in sentiment scores nitude (cohen0 s d = .75), using a t-Test. In addition, for fake and non-fake articles. Afterwards, we look at the percentage of fake articles with a sentiment score articles with particularly high or low sentiment scores of 50 or higher stands at 70.4 compared to real articles to find differences in the writing of the articles that where only 40.6 percent were rated with a score above could influence annotators in their ratings and deter- 50. This shows that indeed fake articles are rated sig- mine whether an article is perceived sentimental. nificantly more sentimental than the non-fake ones. 3.1 Inter-rater Reliability Analysis 3.3 Qualitative Analysis Inter-rater reliability is measured using the Intra Class A first qualitative analysis of the articles rated with Correlation (ICC) Index. A one-way random effects the highest and lowest mean sentiment scores indicates model for absolute agreement with average measures differences in language use and sentence structure. as observation units is assumed (ICC(1,k)). (We fol- Articles with a low sentiment score are mostly elec- lowed the guidelines of [8, 10] to select the ICC model tion reports and contain listings of facts and figures. parameters.) To give examples: ”Solid Republican: Alabama (9), Since not every annotator annotated every article, Alaska (3), Arkansas (6), Idaho (4), Indiana (11), annotators are assumed to be a random effect in the Kansas (6), Kentucky (8), Louisiana (8) [...]”, or model. We chose the minimum number of available ”Clinton’s strength comes from the Atlanta area, where annotations per article (k = 10) as the basis for the she leads Trump 55% to 35%. But Trump leads her reliability analysis. In cases where more than 10 an- 51% to 33% elsewhere in the Peach state. She leads notations were available for an article, we randomly 88% to 4% among (..).” chose 10 annotations. Observational units are average The last example also demonstrates the use of a measures since the sentiment for each article is going repetitive and simple sentence structure, for instance, to be the average of all human annotations for the the iterating use of the word leads. ”In Iowa Sept. given article. 29. In Kansas Oct. 19. [...]” states another example The total Intra Class Correlation is 0.88, which in- for the repeated use of language. On the whole, the dicates good to excellent reliability [10]. Reliability used language seems unemotional, rather neutral and is slightly higher for real (ICC(1, 10) = .90) than for without bias. fake articles (ICC(1, 10) = .76) (see Table 2). Articles with the highest mean score seem to consist Note that there is a large discrepancy between of a larger number of negative words. ”Kill”, ”mur- the average point estimates and the single point es- der”, ”guns”, ”shooting”, ”racism” and ”dead and timates for the same data (ICC(1, 1) = .42, CI[.95] = bloodied” are a few specific examples of negative words [.37, .48]). While this is generally expected [8], we con- we observed in the articles. To some extent, offensive sidered the difference to be large enough to report. language is used which indicates a subjective view and bias. Statements such as ”[...] sick human being un- fit for any political office [...]”, or ”[...] nothing but a bunch of idiot lowlifes” can be quoted as exemplary for offensive language use. In some high-sentiment articles, we also found rhetorical devices such as analogies, comparisons, and rhetorical questions, which do not occur in the same manner in the low-sentiment articles. Analogies and comparisons are initiated by the word like such as in the following sentence: ”Clinton speculated about this, and like a predictable rube under the hot lights Trump cracked under the pressure.” The following sentence gives an example for a rhetorical question found in one of the articles:”Did Trump say he was interested in paying higher taxes? No. Did Trump say he would like to reform the tax code so that he would be forced to pay higher taxes? No.” 4 Comparison between Model Predic- tions and Human Annotations To see how existing sentence-level sentiment analysis models perform on the dataset, we used the Pattern3 4 Web Mining Package [6] and the Stanford Core NLP 5 Package. The Pattern3 package provides a dictionary-based sentiment analyzer with a dictionary of adjectives and their corresponding sentiment polarity and intensity. The model determines the sentiment score of a sen- tence by averaging the sentiment scores of all adjec- tives in a sentence. Scores range between −1.0 (nega- tive) and 1.0 (positive). The Stanford Core NLP package provides a recur- sive neural network model for sentiment analysis [?]. It assigns sentiment labels based on the contents and syntactic structure of a sentence. The output is one of five labels (very negative, negative, neutral, positive, very positive). Model predictions were obtained by processing the articles in the dataset sentence by sentence and aver- aging over the sentence scores. Since the models assign sentiment values on different scales than the one used by our annotators, we mapped the values to match our scale. For the Pattern3 scores, we took the absolute value and multiplied it by 100 and for Stanford scores, we mapped the labels to intensity scores (very nega- tive = 100, negative = 50, neutral = 0, positive = 50, very positive = 100). Figure 1: Histograms of sentiment scores. Values are Human ratings represent the average sentiment sorted into 10 categories score per article. 4 https://github.com/pattern3 5 https://stanfordnlp.github.io/CoreNLP/ sentence-level sentiment estimates are unable to match human estimates for entire articles. Sentence-level models underestimate true sentiment scores, probably due to the fact that results are averaged over the sen- timents of all sentences. The fact that the Pattern3 predictions are generally lower than the ones of Stan- ford Core NLP supports this hypothesis, as Pattern3 averages over all adjectives in a sentence and all sen- tences, whereas the Stanford model is only averaged over all sentences in the article. If an article contains mostly neutral sentences and only a few sentences with Figure 2: Scatter plot showing human ratings and strong emotional statements, these models will assign model predictions for each sentiment analyzer. the article a relatively low score. Contrarily, for human 4.1 Results readers, already a few of such emotionally-charged sen- tences can shape the perception of the entire article. In general, model predictions are lower than the hu- Sentiment analysis models should, therefore, operate man ratings and span a more narrower of values. at the article level rather than at the sentence level. Model predictions of the Pattern3 Sentiment Analyzer Our dataset can be used to train such models and is range from 2.04 to 32.99 with a mean of 14.81 and a thus a valuable addition to the collection of available standard deviation of 5.47. Predictions of the Stanford sentiment datasets. Core NLP Analyzer range from 24.07 to 62.5 with a Furthermore, fake and real articles differ in the dis- mean of 43.18 and a standard deviation of 5.69. On tribution of sentiment annotations. Real articles in the other hand, human annotations span a wide range our dataset receive significantly lower sentiment scores of values from 4.55 to 95.25, with a mean of 49.39 and than fake ones. This qualifies sentiment as a potential a standard deviation of 21.77. feature for fake news classification of political news ar- Figure 2 shows a scatter plot of the human rat- ticles. Sentence-level models failed to generate scores ings and the model predictions. The correlations are that reflect this relation. Models could be improved significant yet very small (r = .171, R2 = .029, p < by making predictions on the article level and by us- .001 for Pattern3, r = −.139, R2 = .019, p = .028 ing our dataset for training. for Stanford Core NLP ) and prediction errors are high, while those of Pattern3 are larger (M SE = Future research could be aimed at examining this 1657.87, M AE = 35.05) than those of Stanford Core finding further by incorporating more articles, poten- NLP (M SE = 576, M AE = 20.42). We also looked at tially also from different topic domains, as our dataset the distribution of sentiment scores for the model pre- includes only political news articles. dictions. When comparing scores assigned to fake arti- We started investigating where differences in senti- cles (mean = 15.19) and scores assigned to real articles ment may be coming from and (unsurprisingly) find (mean = 14.43), the predictions do not differ signifi- that more extreme and emotionally-charged state- cantly (t(248) = 1.09, p = .28, cohen0 s d = 0.14). On ments were used in high-sentiment articles. As men- the other hand, analyzing the scores assigned by hu- tioned earlier, the interesting finding here is that even man annotators on the article level, we found a signif- a few such statements seem to affect the overall im- icant difference between fake articles (mean = 60.36) pression of an article’s sentiment. and real articles (mean = 38.42) with a large mag- In future studies, this investigation could be ex- nitude (t(248) = 9.21, p < .001, cohen0 s d = 1.17). panded either by detecting which sentences have the The results indicate that the computation of an over- largest impact on the overall sentiment score of an arti- all sentiment score based on sentence-level sentiment cle or by identifying individual-level determinants that scores is not useful for fake news detection. However, affect people’s perception of sentiment in an article. human ratings at article level can indeed be used to distinguish between fake and non-fake articles. 6 ACKNOWLEDGEMENTS 5 Discussion and Conclusion This work was funded by the Global Young Faculty6 A new human annotated sentiment dataset is pre- and the Deutsche Forschungsgemeinschaft (DFG, Ger- sented in this paper. To the best of our knowledge, man Research Foundation) - GRK 2167, Research it is the first dataset providing high quality, article- Training Group “User-Centred Social Media”. level sentiment ratings. Our analysis of model predictions shows that 6 https://www.global-young-faculty.de/ References practic medicine 15, 27330520 (June 2016), 155– 163. [1] Agarwal, A., Xie, B., Vovsha, I., Rambow, O., and Passonneau, R. Sentiment analysis [11] Kouloumpis, E., Wilson, T., and Moore, of twitter data. In Proceedings of the Workshop J. Twitter sentiment analysis: The good the bad on Language in Social Media (LSM 2011) (2011), and the omg! In Fifth International AAAI con- pp. 30–38. ference on weblogs and social media (2011). [2] Aker, A., Gravenkamp, H., Mayer, Sab- [12] Liu, B. Sentiment analysis and opinion mining. rina, J., Hamacher, M., Smets, A., Nti, A., Synthesis lectures on human language technolo- Erdmann, J., Serong, Julia Welpinghus, gies 5, 1 (2012), 1–167. A., and Marchi, F. Corpus of news articles an- notated with article level subjectivity. In ROME [13] Mejova, Y. Sentiment analysis: An overview. 2019: Workshop on Reducing Online Misinfor- University of Iowa, Computer Science Depart- mation Exposure (2019). ment (2009). [3] Cambria, E., Das, D., Bandyopadhyay, S., [14] Pak, A., and Paroubek, P. Twitter as a cor- and Feraco, A. A practical guide to sentiment pus for sentiment analysis and opinion mining. In analysis. Springer, 2017. LREc (2010), vol. 10, pp. 1320–1326. [4] Cambria, E., Schuller, B., Xia, Y., and [15] Shu, K., Mahudeswaran, D., Wang, S., Lee, Havasi, C. New avenues in opinion mining and D., and Liu, H. Fakenewsnet: A data repository sentiment analysis. IEEE Intelligent systems 28, with news content, social context and dynamic in- 2 (2013), 15–21. formation for studying fake news on social media. CoRR abs/1809.01286 (2018). [5] Conroy, N. J., Rubin, V. L., and Chen, Y. Automatic deception detection: Methods for find- ing fake news. Proceedings of the Association for Information Science and Technology 52, 1 (2015), 1–4. [6] De Smedt, T., and Daelemans, W. Pattern for python. J. Mach. Learn. Res. 13, 1 (June 2012), 2063–2067. [7] Fuhr, N., Nejdl, W., Peters, I., Stein, B., Giachanou, A., Grefenstette, G., Gurevych, I., Hanselowski, A., Jarvelin, K., Jones, R., Liu, Y., and Mothe, J. An in- formation nutritional label for online documents. ACM SIGIR Forum 51, 3 (feb 2018), 46–66. [8] Hallgren, K. A. Computing inter-rater relia- bility for observational data: An overview and tu- torial. Tutorials in quantitative methods for psy- chology 8, 22833776 (2012), 23–34. [9] Kevin, V., Högden, B., Schwenger, C., Sa- han, A., Madan, N., Aggarwal, P., Ban- garu, A., Muradov, F., and Aker, A. Infor- mation nutrition labels: A plugin for online news evaluation. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) (Brussels, Belgium, Nov. 2018), Association for Computational Linguistics, pp. 28–33. [10] Koo, T. K., and Li, M. Y. A guideline of se- lecting and reporting intraclass correlation coef- ficients for reliability research. Journal of chiro-