Corpus of news articles annotated with article-level
                          sentiment

                             Ahmet Aker, Hauke Gravenkamp, Sabrina J. Mayer
                                 University of Duisburg-Essen, Germany
                                     firstName.lastName@uni-due.de

                                   Marius Hamacher, Anne Smets, Alicia Nti
                                    University of Duisburg-Essen, Germany
                                     firstName.lastName@stud.uni-due.de

 Johannes Erdmann, Julia Serong, Anna Welpinghus                                     Francesco Marchi
    Technical University of Dortmund, Germany                               Ruhr University Bochum, Germany
       firstName.lastName@tu.dortmund.de                                       firstName.lastName@rub.de


                                                                 1    Introduction

                        Abstract                                 Nowadays, the amount of online news content is im-
                                                                 mense and its sources are very diverse. For the read-
                                                                 ers and other consumers of online news who value bal-
    Research on sentiment analysis is in its ma-                 anced, diverse, and reliable information, it is necessary
    ture status. Studies on this topic have pro-                 to have access to additional information to evaluate the
    posed various solutions and datasets to guide                available news articles. For this purpose, Fuhr et al. [7]
    machine-learning approaches. However, so far                 propose to label every online news article with infor-
    the sentiment scoring is restricted to the level             mation nutrition labels to describe the ingredients of
    of short textual units such as sentences. Our                the article and thus give the reader a chance to eval-
    comparison shows that there is a huge gap                    uate what she is reading. This concept is analogous
    between machines and human judges when                       to food packages where nutrition labels help buyers in
    the task is to determine sentiment scores of                 their decision-making. The authors discuss 9 different
    a longer text such as a news article. To close               information nutrition labels including sentiment. The
    this gap, we propose a new human-annotated                   sentiment of a news article is subtly reflected by the
    dataset containing 250 news articles with sen-               tone and effective content of a writer’s words [5]. Fuhr
    timent labels at article level. Each article is              et al. [7] conclude that knowing about an article’s level
    annotated by at least 10 people. The articles                of sentiment could help the reader to judge the credi-
    are evenly divided into fake and non-fake cate-              bility and whether it is trying to deceive the reader by
    gories. Our investigation on this corpus shows               relying on emotional communication.
    that fake articles are significantly more senti-
                                                                    Sentiment analysis is a mature research direction
    mental than non-fake ones. The dataset will
                                                                 and has been summarized by several overview papers
    be made publicly available.
                                                                 and books [13, 3, 4]. Commonly, sentiment is com-
                                                                 puted on a small fraction of text such as a phrase or
Copyright © 2019 for the individual papers by the papers’ au-    sentence. Using this strategy, authors of [11, 14, 1]
thors. Copying permitted for private and academic purposes.      analyze for instance Twitter posts. To compute senti-
This volume is published and copyrighted by its editors.
                                                                 ment over a text, such as a news article that spans
In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen,
M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the
                                                                 over several sentences, [9] use the aggregated aver-
NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019,        age sentiment score of the text’s sentences. However,
published at http://ceur-ws.org                                  our current study shows that this does not align with
the human perception of sentiment. If there are only,        Table 1:      Textual statistics about articles in the
e.g. two sentences in the article which are sentimen-        dataset.
tally loaded and the remaining sentences are neutral,
a sentence-based sentiment scorer will label the arti-                                                fake   non-fake
cle as not sentimental or will assign a low sentiment
                                                             text length                   min        820         720
score. On the contrary, our study shows that humans
                                                                                           max      10062       12959
may consider the entire article as highly sentimental
                                                                                         median      2576        3003
even if there are only 1-2 sentences that are highly
                                                                                          mean     2832.4      4124.4
sentimental.
   In this work, we propose to release a dataset con-        sentences                     min           6           6
taining 250 news articles with article-level sentiment                                     max          88         144
labels.1 These labels were assigned to each article by                                   median         22          27
at least 10 paid annotators. To our knowledge, this is                                    mean        24.4        36.1
the first article-level sentiment labeled corpus. We be-     sentence average words        min        11.0         8.0
lieve this corpus will open new ways of addressing the                                     max        35.7        36.7
sentiment perception gap between humans and ma-                                          median       19.8        19.5
chines. Over this corpus, we also run two automatic                                       mean        20.6        19.9
sentiment assessors and show that their scores do not
correlate with human-assigned scores.
                                                                   Annotators were recruited from colleagues and
   In addition, our articles are split into fake (125)
                                                               friends and were encouraged to refer the annotation
and non-fake (125) articles. We show that at the arti-
                                                               project to their acquaintances. They were free to rate
cle level, fake articles are significantly more sentimen-
                                                               as many articles as they liked and were compensated
tal than the non-fake ones. This finding supports the
                                                               with 3.5€ (or 3£ if they were residents of the UK) per
assumption that sentiment will help readers to distin-
                                                               article. The recruitment method and relatively high
guish between credible and non-credible articles.
                                                               monetary compensation were chosen to ensure high
   In the following, we will first describe the dataset
                                                               data quality.
annotated with sentiment at article level (Section 2).
In Section 3, we present inter-rater agreement among               Sentiment was rated in two different ways. First,
the annotators, the analysis of sentiment provided for         annotators were asked to rate textual qualities of the
fake and non-fake articles, as well as a qualitative anal-     given article that indicate sentiment, for instance, The
ysis of articles with low and high sentiment scores.           article contains many words that transport particu-
In Section 4, we provide results about our correlation         larly strong emotions.. These qualities were measured
analysis between human sentiment scores and those              by five properties on a 5-Point Rating Scale, labeled
obtained automatically. Finally, we discuss our find-          Strongly Disagree to Strongly Agree. Afterwards, an-
ings and conclude the paper in Section 5.                      notators were asked to rate sentiment directly on a
                                                               percentage scale (Overall, how emotionally charged is
                                                               the article? Judge on a scale from 0-100 ), 100 indi-
2 Dataset                                                      cating high sentiment intensity and 0 indicating low
We retrieved the news articles annotated in this work          sentiment intensity.
from FakeNewsNet [15], a corpus of news stories di-                We opted for the two-fold annotation approach to
vided into fake and non-fake articles. To determine            generate sentiment scores that could be used to train
whether a story is fake or not, the FakeNewsNet au-            machine-learning models as well as sentiment indica-
thors extracted articles and veracity scores from two          tors that could provide insights as to why and how
prevalent fact-checking sites PolitiFact 2 and Gossip-         people rate the level of sentiment of an article. In the
Cop 3 . We sampled 125 fake and 125 non-fake articles          present work, however, we only analyze the percentage
from this corpus. All articles are dealing with polit-         scores for sentiment. When referring to annotations,
ical news, mostly the 2016 US presidential election.           we refer to these sentiment scores. The other senti-
Table 1 lists textual statistics about the articles.           ment variables are not discussed in this current work
   Each news article was rated between 10 and                  due to spacial constraints.
22 times (mean = 15.524, median = 15) and                          Note that annotators did not annotate the senti-
each annotator rated 1 to 250 articles (mean =                 ment polarity, e.g. ”highly positive” or ”slightly nega-
42.185, median = 17).                                          tive”, but only the sentiment intensity, e.g. ”high” or
                                                               ”low”.
   1 https://github.com/ahmetaker/newsArticlesWithSentimentScore       In this scheme, highly positive and highly neg-
   2 https://www.politifact.com/                               ative articles receive the same score. We chose this
   3 https://www.gossipcop.com/                                annotation scheme since article-level polarity seems
less informative for an entire article: In cases where                Table 2: Intraclass Correlation Values
a single article praises one position and condemns an-
other, giving an overall polarity score is ambiguous                                                      95% CI
and sentence-level polarity scores may be more infor-                     N   Raters    Unit   ICC     Lower    Upper
mative.
   The notion of sentiment intensity is still different      total        250 10        average .88       .86      .90
from subjectivity. A subjective statement contains per-                                 single .42        .37      .48
sonal views of an author whereas an objective article        fake         125 10        average .76       .67      .81
contains facts about a certain topic. Both subjective        non-fake     125 10        average .90       .87      .92
and objective statements may or may not contain sen-
timent [12]. For example, ”the man was killed” ex-           3.2     Annotation Distribution
presses a negative sentiment in an objective fashion,        The dataset contains 3788 sentiment score annota-
while ”I believe the earth is flat” is a subjective state-   tions, ranging between 0 and 100. The mean score
ment expressing no sentiment. For an investigation of        is 49.92 with a standard deviation of 32.54. When
article level subjectivity, see [2].                         looking at all articles, scores are mostly uniformly dis-
                                                             tributed with minor peaks at the maximum and min-
3     Analysis of Sentiment Scores                           imum values (see Figure 1). The distribution changes
First, we measure differences in inter-rater agreement       when dividing the articles into fake and real ones. Fake
for fake and non-fake articles in order to see whether       articles receive higher scores (mean = 61.50) than
the annotators agree on the judgments or not. We             non-fake ones (mean = 38.69). We found a significant
also analyze the distribution of sentiment ratings to        difference (t(3786) = 22.99, p < .001) of medium mag-
see whether there are differences in sentiment scores        nitude (cohen0 s d = .75), using a t-Test. In addition,
for fake and non-fake articles. Afterwards, we look at       the percentage of fake articles with a sentiment score
articles with particularly high or low sentiment scores      of 50 or higher stands at 70.4 compared to real articles
to find differences in the writing of the articles that      where only 40.6 percent were rated with a score above
could influence annotators in their ratings and deter-       50. This shows that indeed fake articles are rated sig-
mine whether an article is perceived sentimental.            nificantly more sentimental than the non-fake ones.

3.1   Inter-rater Reliability Analysis                       3.3     Qualitative Analysis
Inter-rater reliability is measured using the Intra Class    A first qualitative analysis of the articles rated with
Correlation (ICC) Index. A one-way random effects            the highest and lowest mean sentiment scores indicates
model for absolute agreement with average measures           differences in language use and sentence structure.
as observation units is assumed (ICC(1,k)). (We fol-            Articles with a low sentiment score are mostly elec-
lowed the guidelines of [8, 10] to select the ICC model      tion reports and contain listings of facts and figures.
parameters.)                                                 To give examples: ”Solid Republican: Alabama (9),
   Since not every annotator annotated every article,        Alaska (3), Arkansas (6), Idaho (4), Indiana (11),
annotators are assumed to be a random effect in the          Kansas (6), Kentucky (8), Louisiana (8) [...]”, or
model. We chose the minimum number of available              ”Clinton’s strength comes from the Atlanta area, where
annotations per article (k = 10) as the basis for the        she leads Trump 55% to 35%. But Trump leads her
reliability analysis. In cases where more than 10 an-        51% to 33% elsewhere in the Peach state. She leads
notations were available for an article, we randomly         88% to 4% among (..).”
chose 10 annotations. Observational units are average           The last example also demonstrates the use of a
measures since the sentiment for each article is going       repetitive and simple sentence structure, for instance,
to be the average of all human annotations for the           the iterating use of the word leads. ”In Iowa Sept.
given article.                                               29. In Kansas Oct. 19. [...]” states another example
   The total Intra Class Correlation is 0.88, which in-      for the repeated use of language. On the whole, the
dicates good to excellent reliability [10]. Reliability      used language seems unemotional, rather neutral and
is slightly higher for real (ICC(1, 10) = .90) than for      without bias.
fake articles (ICC(1, 10) = .76) (see Table 2).                 Articles with the highest mean score seem to consist
   Note that there is a large discrepancy between            of a larger number of negative words. ”Kill”, ”mur-
the average point estimates and the single point es-         der”, ”guns”, ”shooting”, ”racism” and ”dead and
timates for the same data (ICC(1, 1) = .42, CI[.95] =        bloodied” are a few specific examples of negative words
[.37, .48]). While this is generally expected [8], we con-   we observed in the articles. To some extent, offensive
sidered the difference to be large enough to report.         language is used which indicates a subjective view and
                                                       bias. Statements such as ”[...] sick human being un-
                                                       fit for any political office [...]”, or ”[...] nothing but a
                                                       bunch of idiot lowlifes” can be quoted as exemplary
                                                       for offensive language use.
                                                           In some high-sentiment articles, we also found
                                                       rhetorical devices such as analogies, comparisons, and
                                                       rhetorical questions, which do not occur in the same
                                                       manner in the low-sentiment articles. Analogies and
                                                       comparisons are initiated by the word like such as in
                                                       the following sentence: ”Clinton speculated about this,
                                                       and like a predictable rube under the hot lights Trump
                                                       cracked under the pressure.” The following sentence
                                                       gives an example for a rhetorical question found in
                                                       one of the articles:”Did Trump say he was interested
                                                       in paying higher taxes? No. Did Trump say he would
                                                       like to reform the tax code so that he would be forced
                                                       to pay higher taxes? No.”


                                                       4     Comparison between Model Predic-
                                                             tions and Human Annotations
                                                       To see how existing sentence-level sentiment analysis
                                                       models perform on the dataset, we used the Pattern3 4
                                                       Web Mining Package [6] and the Stanford Core NLP 5
                                                       Package.
                                                          The Pattern3 package provides a dictionary-based
                                                       sentiment analyzer with a dictionary of adjectives and
                                                       their corresponding sentiment polarity and intensity.
                                                       The model determines the sentiment score of a sen-
                                                       tence by averaging the sentiment scores of all adjec-
                                                       tives in a sentence. Scores range between −1.0 (nega-
                                                       tive) and 1.0 (positive).
                                                          The Stanford Core NLP package provides a recur-
                                                       sive neural network model for sentiment analysis [?].
                                                       It assigns sentiment labels based on the contents and
                                                       syntactic structure of a sentence. The output is one of
                                                       five labels (very negative, negative, neutral, positive,
                                                       very positive).
                                                          Model predictions were obtained by processing the
                                                       articles in the dataset sentence by sentence and aver-
                                                       aging over the sentence scores. Since the models assign
                                                       sentiment values on different scales than the one used
                                                       by our annotators, we mapped the values to match our
                                                       scale. For the Pattern3 scores, we took the absolute
                                                       value and multiplied it by 100 and for Stanford scores,
                                                       we mapped the labels to intensity scores (very nega-
                                                       tive = 100, negative = 50, neutral = 0, positive = 50,
                                                       very positive = 100).
Figure 1: Histograms of sentiment scores. Values are      Human ratings represent the average sentiment
sorted into 10 categories                              score per article.

                                                           4 https://github.com/pattern3
                                                           5 https://stanfordnlp.github.io/CoreNLP/
                                                           sentence-level sentiment estimates are unable to match
                                                           human estimates for entire articles. Sentence-level
                                                           models underestimate true sentiment scores, probably
                                                           due to the fact that results are averaged over the sen-
                                                           timents of all sentences. The fact that the Pattern3
                                                           predictions are generally lower than the ones of Stan-
                                                           ford Core NLP supports this hypothesis, as Pattern3
                                                           averages over all adjectives in a sentence and all sen-
                                                           tences, whereas the Stanford model is only averaged
                                                           over all sentences in the article. If an article contains
                                                           mostly neutral sentences and only a few sentences with
Figure 2: Scatter plot showing human ratings and
                                                           strong emotional statements, these models will assign
model predictions for each sentiment analyzer.
                                                           the article a relatively low score. Contrarily, for human
4.1   Results                                              readers, already a few of such emotionally-charged sen-
                                                           tences can shape the perception of the entire article.
In general, model predictions are lower than the hu-
                                                           Sentiment analysis models should, therefore, operate
man ratings and span a more narrower of values.
                                                           at the article level rather than at the sentence level.
Model predictions of the Pattern3 Sentiment Analyzer
                                                           Our dataset can be used to train such models and is
range from 2.04 to 32.99 with a mean of 14.81 and a
                                                           thus a valuable addition to the collection of available
standard deviation of 5.47. Predictions of the Stanford
                                                           sentiment datasets.
Core NLP Analyzer range from 24.07 to 62.5 with a
                                                              Furthermore, fake and real articles differ in the dis-
mean of 43.18 and a standard deviation of 5.69. On
                                                           tribution of sentiment annotations. Real articles in
the other hand, human annotations span a wide range
                                                           our dataset receive significantly lower sentiment scores
of values from 4.55 to 95.25, with a mean of 49.39 and
                                                           than fake ones. This qualifies sentiment as a potential
a standard deviation of 21.77.
                                                           feature for fake news classification of political news ar-
   Figure 2 shows a scatter plot of the human rat-
                                                           ticles. Sentence-level models failed to generate scores
ings and the model predictions. The correlations are
                                                           that reflect this relation. Models could be improved
significant yet very small (r = .171, R2 = .029, p <
                                                           by making predictions on the article level and by us-
.001 for Pattern3, r = −.139, R2 = .019, p = .028
                                                           ing our dataset for training.
for Stanford Core NLP ) and prediction errors are
high, while those of Pattern3 are larger (M SE =              Future research could be aimed at examining this
1657.87, M AE = 35.05) than those of Stanford Core         finding further by incorporating more articles, poten-
NLP (M SE = 576, M AE = 20.42). We also looked at          tially also from different topic domains, as our dataset
the distribution of sentiment scores for the model pre-    includes only political news articles.
dictions. When comparing scores assigned to fake arti-        We started investigating where differences in senti-
cles (mean = 15.19) and scores assigned to real articles   ment may be coming from and (unsurprisingly) find
(mean = 14.43), the predictions do not differ signifi-     that more extreme and emotionally-charged state-
cantly (t(248) = 1.09, p = .28, cohen0 s d = 0.14). On     ments were used in high-sentiment articles. As men-
the other hand, analyzing the scores assigned by hu-       tioned earlier, the interesting finding here is that even
man annotators on the article level, we found a signif-    a few such statements seem to affect the overall im-
icant difference between fake articles (mean = 60.36)      pression of an article’s sentiment.
and real articles (mean = 38.42) with a large mag-            In future studies, this investigation could be ex-
nitude (t(248) = 9.21, p < .001, cohen0 s d = 1.17).       panded either by detecting which sentences have the
The results indicate that the computation of an over-      largest impact on the overall sentiment score of an arti-
all sentiment score based on sentence-level sentiment      cle or by identifying individual-level determinants that
scores is not useful for fake news detection. However,     affect people’s perception of sentiment in an article.
human ratings at article level can indeed be used to
distinguish between fake and non-fake articles.
                                                           6     ACKNOWLEDGEMENTS
5     Discussion and Conclusion                            This work was funded by the Global Young Faculty6
A new human annotated sentiment dataset is pre-            and the Deutsche Forschungsgemeinschaft (DFG, Ger-
sented in this paper. To the best of our knowledge,        man Research Foundation) - GRK 2167, Research
it is the first dataset providing high quality, article-   Training Group “User-Centred Social Media”.
level sentiment ratings.
   Our analysis of model predictions shows that                6 https://www.global-young-faculty.de/
References                                                      practic medicine 15, 27330520 (June 2016), 155–
                                                                163.
 [1] Agarwal, A., Xie, B., Vovsha, I., Rambow,
     O., and Passonneau, R. Sentiment analysis              [11] Kouloumpis, E., Wilson, T., and Moore,
     of twitter data. In Proceedings of the Workshop             J. Twitter sentiment analysis: The good the bad
     on Language in Social Media (LSM 2011) (2011),              and the omg! In Fifth International AAAI con-
     pp. 30–38.                                                  ference on weblogs and social media (2011).
 [2] Aker, A., Gravenkamp, H., Mayer, Sab-                  [12] Liu, B. Sentiment analysis and opinion mining.
     rina, J., Hamacher, M., Smets, A., Nti, A.,                 Synthesis lectures on human language technolo-
     Erdmann, J., Serong, Julia Welpinghus,                      gies 5, 1 (2012), 1–167.
     A., and Marchi, F. Corpus of news articles an-
     notated with article level subjectivity. In ROME       [13] Mejova, Y. Sentiment analysis: An overview.
     2019: Workshop on Reducing Online Misinfor-                 University of Iowa, Computer Science Depart-
     mation Exposure (2019).                                     ment (2009).

 [3] Cambria, E., Das, D., Bandyopadhyay, S.,               [14] Pak, A., and Paroubek, P. Twitter as a cor-
     and Feraco, A. A practical guide to sentiment               pus for sentiment analysis and opinion mining. In
     analysis. Springer, 2017.                                   LREc (2010), vol. 10, pp. 1320–1326.

 [4] Cambria, E., Schuller, B., Xia, Y., and                [15] Shu, K., Mahudeswaran, D., Wang, S., Lee,
     Havasi, C. New avenues in opinion mining and                D., and Liu, H. Fakenewsnet: A data repository
     sentiment analysis. IEEE Intelligent systems 28,            with news content, social context and dynamic in-
     2 (2013), 15–21.                                            formation for studying fake news on social media.
                                                                 CoRR abs/1809.01286 (2018).
 [5] Conroy, N. J., Rubin, V. L., and Chen, Y.
     Automatic deception detection: Methods for find-
     ing fake news. Proceedings of the Association for
     Information Science and Technology 52, 1 (2015),
     1–4.
 [6] De Smedt, T., and Daelemans, W. Pattern
     for python. J. Mach. Learn. Res. 13, 1 (June
     2012), 2063–2067.
 [7] Fuhr, N., Nejdl, W., Peters, I., Stein,
     B., Giachanou, A., Grefenstette, G.,
     Gurevych, I., Hanselowski, A., Jarvelin,
     K., Jones, R., Liu, Y., and Mothe, J. An in-
     formation nutritional label for online documents.
     ACM SIGIR Forum 51, 3 (feb 2018), 46–66.
 [8] Hallgren, K. A. Computing inter-rater relia-
     bility for observational data: An overview and tu-
     torial. Tutorials in quantitative methods for psy-
     chology 8, 22833776 (2012), 23–34.
 [9] Kevin, V., Högden, B., Schwenger, C., Sa-
     han, A., Madan, N., Aggarwal, P., Ban-
     garu, A., Muradov, F., and Aker, A. Infor-
     mation nutrition labels: A plugin for online news
     evaluation. In Proceedings of the First Workshop
     on Fact Extraction and VERification (FEVER)
     (Brussels, Belgium, Nov. 2018), Association for
     Computational Linguistics, pp. 28–33.
[10] Koo, T. K., and Li, M. Y. A guideline of se-
     lecting and reporting intraclass correlation coef-
     ficients for reliability research. Journal of chiro-