Good, Neutral or Bad - News Classification

    Aashish Agarwal, Ankita Mandal, Matthias Schaffeld, Fangzheng Ji, Jihao Zhang, Yiqi Sun
                                  University of Duisburg-Essen
                                      Duisburg, Germany
                            {firstName.lastName}@stud.uni-due.de

                                                    Ahmet Aker
                                            University of Duisburg-Essen
                                                Duisburg, Germany
                                              a.aker@is.inf.uni-due.de


                                                                 is also demonstrated that the influence of bad news is
                                                                 more significant than good news [13, 2] and that due
                        Abstract                                 to the natural negativity bias, as described by [11],
                                                                 humans may end up consuming more bad than good
     Reading news articles affects the mood and                  news. This is a real threat to the society as accord-
     mindset of the reader. Therefore we want to                 ing to medical doctors and, psychologists exposure to
     provide means to track our daily news con-                  bad news may have severe and long-lasting negative
     sumption activities. In this paper, we release              effects for our well being and lead to stress, anxiety,
     news articles dataset assigned with good, bad               and depression [8]. Furthermore, specific kinds of bad
     and neutral labels. The dataset comprises                   news, for example about unemployment, may affect
     of 300 news articles, each annotated by five                stock markets and in turn, the overall economy [4].
     different annotators. The agreement among                      In our ever-digitized world, with a constant influx
     the annotators is 0.526 according to Krippen-               of news from a variety of sources, differentiating good
     dorff’s Alpha and 0.435 according to Fleis-                 and bad news may help the reader to combat this issue.
     sKappa. We also experiment with four dif-                   A system that filters news based on the content of
     ferent machine learning approaches such as                  the article, no matter the news website a person is
     Naive Bayes, SVM, Logistic Regression and                   following, may enable the user to control the amount
     Deep Learning using LSTM units. Our ex-                     of bad news they are consuming. Whilst most people
     periments show that NaiveBayes significantly                start their day with reading the news, they can then
     outperforms the other three classifiers.                    start it on a positive note.
                                                                    To implement such a news filtering system we cre-
1     Introduction                                               ated a gold standard dataset comprising 300 news ar-
                                                                 ticles annotated by five different raters with good, bad
In the media, the presence of bad news seems to dom-
                                                                 and neutral labels. This dataset will be made publicly
inate over good news. Every day there is at least a
                                                                 accessible and can be used for further research.1
report about terrorism, natural or human-made disas-
                                                                    The definitions of good, bad and neutral news may
ter, a war crime, human right violation, airplane crash,
                                                                 widely vary from individual to individual and from
etc. Studies show that news, in general, has a signif-
                                                                 country to country [7]. Therefore, we defined three
icant impact on our mental stature [8]. However, it
                                                                 categories explicitly - what can be termed as good,
                                                                 bad or neutral news. To measure the quality of the
Copyright c 2019 for the individual papers by the papers’ au-
thors. Copying permitted for private and academic purposes.      ratings we used Fleiss Kappa and Krippendorf’s Alpha
This volume is published and copyrighted by its editors.         to check for inter-rater reliability. We also evaluated
In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen,   several machine learning techniques including Naive
M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the    Bayes, Logistic Regression, Support Vector Machines
NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019,
published at http://ceur-ws.org                                    1 https://github.com/ahmetaker/goodBadNews
and Deep Learning on the collected dataset. These                Number of Articles                300
four techniques should give the first impression on the          Average Sentences Count           24.23
complexity of the task and serve as baselines to fur-            Average Word Count                497.83
ther improve the results. Our initial results show that          Number of good news               52
Naive Bayes significantly outperforms the other three            Number of bad news                131
approaches.                                                      Number of neutral news            117
   In the first section of the paper, we define the terms
good, bad and neutral news. We also describe the                        Table 1: Statistics about the corpus
process of corpus collection and agreement on ratings.
Next, in Section 3, we describe our methods of feature           Fleiss Kappa                      0.435
engineering and our baseline methods. In Section 4 we            Krippendorffs Alpha               0.526
present our results. Finally, we conclude the paper in
                                                                           Table 2: Inter-rater agreement
Section 5 with what can be done as future work.
                                                             nature, animals or human rights.
2     Corpus                                                    Using these exemplified definitions we re-run the
2.1    Definition of good, bad and neutral news              annotation process with another randomly selected 20
                                                             articles and this resulted in more satisfactory annota-
According to the Collins English dictionary2 good            tions so that we used this strategy to create our corpus.
news is defined as “someone or something that is posi-
tive, encouraging, uplifting, desirable, or the like” and    2.2    Corpus Collection
bad news “someone or something regarded as unde-
                                                             Using Newspaper3k3 , we randomly collected a corpus
sirable”. For neutral news, we stated that neither of
                                                             of 300 English news articles4 . The articles come from
this is the case. We used these definitions to start our
                                                             different news agencies such as BBC.co.uk, indepen-
annotation. With these definitions, we run an initial
                                                             dent.co.uk and entail topics from categories such as
annotation process with 20 randomly selected news ar-
                                                             economic, medical, international, local and emergent
ticles. We asked 5 annotators who were undergraduate
                                                             news. We used the exemplified definitions given above
students, with ages varying from 20-25 years, fluent in
                                                             to annotate these as good, bad or neutral news. The
English and frequent online news readers to read the
                                                             same five undergraduate students as above took part
news and provide good, bad or neutral label accord-
                                                             in the annotation task. After gathering the annota-
ing to the above definitions. However, our annota-
                                                             tions for all news articles, we took the majority of the
tors found these definitions not unambiguous enough
                                                             readers’ opinions as the final definition. If no clear ma-
so that we revisited the design of our guidelines. This
                                                             jority vote was found, we introduced a meta reviewer
included using an exemplified definition instead. In
                                                             who was not among the five annotators to give a final
the following we briefly outline these exemplified defi-
                                                             decision. Table 1 gives some stats about the corpus as
nitions:
                                                             well as the distribution of the different classes.
   Good News If the subject of the article is some-             We also computed the agreement among the anno-
one being saved from danger, the creation of medicine        tators. To do this, we used Fleiss’ kappa and Krippen-
which can cure or help with an illness, the end of a war     dorff’s alpha. Table 2 shows the results for inter-rater
or some kind of disaster, human rights being defended,       agreement. From the table, we can see that the agree-
or something that benefits the public or a dangerous         ment is moderate indicating the difficulty of the task.
culprit being arrested.
   Neutral News If the subject of the article is a pop-      3     Experiment
ularization of science, history or geography, describing
humanistic traditions, astronomy, nature, history or         The task of good, bad or neutral news classification is
landscape, scientific literature, news of people’s liveli-   to classify a given online news article to one of those
hood without casualties or daily entertainment and           classes. To find a classifier suited for this task, we ex-
fashion news.                                                plore different traditional machine learning approaches
   Bad News If the subject of the article is a war, ac-      as well as deep learning. In both cases, we only use
cidents, disaster, epidemic disease or killing, criminal     the article content to extract features. More precisely,
activities, the death of a famous or important person,       for the traditional machine learning techniques we use
some sort of discrimination, bullying or stereotypes,        Bag of Words (outlined in the next Section) and for
some negative influence or event regarding economics,            3 https://pypi.org/project/newspaper3k/
                                                                4 These 300 articles are exclusive from those 40 articles used
    2 https://www.collinsdictionary.com/                     to refine the annotation definitions.
deep learning the lead parts of each article represented      Logistic Regression is one of the most popular
with word embeddings.                                      supervised classification algorithms. Multinomial Lo-
                                                           gistic Regression is the generalization of the Logistic
3.1   Feature Engineering                                  Regression algorithm which can be used to conduct
                                                           when the dependent variable is nominal with more
For the traditional machine learning approaches, we        than two levels. It is a model that is used to predict
use Bag of Words (BoW) as the only feature category.       the probabilities of the different possible outcomes of
In total, our vocabulary contains 19000 tokens includ-     a categorically distributed dependent variable, given
ing stop words, digits, inflected forms of the words,      a set of independent variables. Using Grid-search, we
etc. We use the following pre-processing steps to re-      set C to 50 and regularization to l2.
duce the vocabulary size to 13000 words:
                                                              The SVM problem is to find the decision hyper-
  • Lower casing the article texts.
                                                           plane that can maximize the margin between the
  • Removing stop-words.                                   data points of the classes [5]. Corresponding to our
                                                           Grid-search analysis, we use a linear kernel and set C
  • Removing digits and punctuation marks.                 to 10.

  • Removing contractions.                                    Our deep learning model comprises a simple
                                                           LSTM layer [1] that is capable to consider sequential
  • Depicting all numbers as #.                            information. The input of the LSTM (50 LSTMs) layer
  • Lemmatizing the words.                                 is word embeddings. We obtain the embeddings from
                                                           the input documents. Note, as stated above instead us-
   Each of the words is represented using term fre-        ing the entire article as input we use only the lead part
quency (TF) (number of times a word occurs in a par-       of each article which can be considered as the sum-
ticular news article) and inverse document frequency       mary of news article [14]. For simplicity and also to
(IDF) (number of articles from the corpus the word         have a common input length across all the articles we
appears in). We further reduce the vocabulary size by      use the first 400 words of each article as the lead part
only using the significant words. For this, we use the     of the article. We use a Dropout layer after the LSTM
Chi-square test and select those words that were sig-      (0.1), which is followed by a dense layer (50 units with
nificant in discriminating the classes. After this step,   ReLu activation) and then again by a Dropout layer
the vocabulary contains around 3600 words. We use          (0.35) and finally by a SoftMax layer. We use Adam
these words represented using TF*IDF to guide our          as the optimization function with 0.001 learning rate
traditional machine learning approaches.                   and Xavier Initialization for weight initialization. The
   For the deep learning technique, we use the lead        loss is determined by categorical crossentropy together
part of each article, convert each word in this part       with l2 regularization. Our batch size is 64, and Epoch
into word embeddings and use these to represent each       number is set to 40.
article.
                                                           4   Results
3.2   Baselines
                                                           The results of the performances of the different clas-
As baselines, we experiment with Naive Bayes classi-       sifiers are presented in Table 3. In all cases, we used
fier, Support Vector Machines, Multinomial Logistic        10-fold cross-validation and report in macro-averaged
Regression and a deep learning model using LSTMs.          F1 measure, precision and recall. From the results,
    Naive Bayes is often used in text classification       we see that the best performing classifier is the Naive
applications and experiments because of its simplicity     Bayes outperforming all the other classifiers. Signifi-
and effectiveness [10]. It uses a probabilistic model      cance test using paired t-test with Bonferroni correc-
of text. Naive Bayes classifier is highly scalable,        tion (p < 0.0125) [3] shows that the Naive Bases clas-
requiring several parameters linear in the number of       sifier significantly outperforms the other classifiers.
variables (features/predictors) in a learning problem
[12]. Maximum-likelihood training can be done by
                                                           5   Conclusion and Future Work
evaluating a closed-form expression, which takes
linear time, rather than by expensive iterative approx-    In this paper, we propose to release a dataset con-
imation as used for many other types of classifiers.       taining news articles annotated with good, bad and
Determined by grid-search, we set alpha to 0.01.           neutral labels. We have a total of 300 news articles
                                                           in our dataset where each article has been annotated
    Classifier    Accuracy   Precision   Recall   F1        GRK 2167, Research Training Group “User-Centred
    NaiveBayes    0.829      0.828       0.796    0.799     Social Media”.
    SVM           0.717      0.517       0.583    0.533
    LogReg        0.700      0.475       0.565    0.511     References
    LSTM          0.594      0.415       0.478    0.533
                                                             [1] Bahdanau, D., Cho, K., and Bengio, Y.
Table 3: Overall Classifier Performance Comparison               Neural machine translation by jointly learn-
                                                                 ing to align and translate.  arXiv preprint
by five different annotators. We computed the inter-             arXiv:1409.0473 (2014).
rater agreement using Krippendorff’s Alpha and Fleiss
Kappa. According to Krippendorff’s Alpha, the agree-         [2] Baumeister, R. F., Bratslavsky, E., Finke-
ment is 0.526 and according to Fleiss Kappa 0.435. We            nauer, C., and Vohs, K. D. Bad is stronger
also experiment with four different machine learning             than good. Review of General Psychology 5, 4
approaches such as Naive Bayes, SVM, Logistic Re-                (2001), 323–370.
gression and Deep Learning using LSTM to provide
initial results on the task. Our experiments show that       [3] Bland, J. M., and Altman, D. G. Multiple
Naive Bayes significantly outperforms the other three            significance tests: the bonferroni method. Bmj
classifiers.                                                     310, 6973 (1995), 170.
   In the future, we plan to extend the dataset. This
                                                             [4] Boyd, J. H., Hu, J., and Jagannathan, R.
would allow the approaches to gain more stability,
                                                                 The stock market’s reaction to unemployment
especially the deep learning strategies whose perfor-
                                                                 news: Why bad news is usually good for stocks.
mance rely on bigger training data. We also plan to
                                                                 Journal of Finance 60, 2 (2005), 649–672.
investigate features other than Bag of Words to cap-
ture sentiments, emotions and similar linguistic as-         [5] Colas F., B. P. Comparison of svm and some
pects that better distinguish between bad and good               older classification algorithms in text classifica-
news.                                                            tion tasks. in: Bramer m. (eds) artificial intel-
                                                                 ligence in theory and practice. IFIP Interna-
6     Application                                                tional Federation for Information Processing 217
                                                                 (2006).
Nowadays, the amount of online news content is im-
mense and its sources are very diverse. For the readers      [6] Fuhr, N., Nejdl, W., Peters, I., Stein,
and other consumers of online news who value bal-                B., Giachanou, A., Grefenstette, G.,
anced, diverse and reliable information, it is necessary         Gurevych, I., Hanselowski, A., Jarvelin,
to have access to additional information to evaluate             K., Jones, R., Liu, Y., and Mothe, J. An in-
the news articles available to them. For this purpose,           formation nutritional label for online documents.
Fuhr et al. [6] propose to label every online news ar-           ACM SIGIR Forum 51, 3 (feb 2018), 46–66.
ticle with information nutrition labels to describe the
ingredients of the article and thus give the reader a        [7] Giner, B., and Rees, W. On the asymmetric
chance to evaluate what she is reading. This concept             recognition of good and bad news in france, ger-
is analogous to food packages where nutrition labels             many and the united kingdom. Journal of Busi-
help buyers in their decision making. The authors dis-           ness Finance & Accounting 28, 910, 1285–1331.
cuss 9 different information nutrition including sen-
timent, subjectivity, objectivity, ease of reading, etc.     [8] Johnston, W. M., and Davey, G. C. L. The
We propose the bad/good/neutral classification as an             psychological impact of negative tv news bul-
additional information nutrition label and plan to im-           letins: The catastrophizing of personal worries.
plement this in our freely available News-Scan5 tool             British Journal of Psychology 88, 1 (1997), 85–
[9]. This tool is a browser plugin that can be evoked            91.
by users to obtain nutrition labels for the articles they
are currently reading.                                       [9] Kevin, V., Högden, B., Schwenger, C., Sa-
                                                                 han, A., Madan, N., Aggarwal, P., Ban-
                                                                 garu, A., Muradov, F., and Aker, A. Infor-
7     ACKNOWLEDGEMENTS                                           mation nutrition labels: A plugin for online news
This work was funded by the Deutsche Forschungs-                 evaluation. In Proceedings of the First Workshop
gemeinschaft (DFG, German Research Foundation) -                 on Fact Extraction and VERification (FEVER)
                                                                 (Brussels, Belgium, Nov. 2018), Association for
    5 www.news-scan.com                                          Computational Linguistics, pp. 28–33.
[10] Kim S. B., Rim H. C., Y. D. S., and S, L. H.
     Effective methods for improving naive bayes text
     classifiers. LNAI 2417 (2002), 414–423.

[11] Rozin, P., and Royzman, E. B. Negativity
     bias, negativity dominance, and contagion. Per-
     sonality and Social Psychology Review 5, 4 (2001),
     296–320.

[12] Russell, Stuart; Norvig, P. Artificial Intel-
     ligence: A Modern Approach(2nd ed.). Prentice-
     Hall, 2003.
[13] Soroka, S. N. Good news and bad news: Asym-
     metric responses to economic information. The
     Journal of Politics 68, 2 (2006), 372–385.
[14] Wasson, M. Using leading text for news sum-
     maries: Evaluation results and implications for
     commercial summarization applications. In COL-
     ING 1998 Volume 2: The 17th International
     Conference on Computational Linguistics (1998),
     vol. 2.