Good, Neutral or Bad - News Classification Aashish Agarwal, Ankita Mandal, Matthias Schaffeld, Fangzheng Ji, Jihao Zhang, Yiqi Sun University of Duisburg-Essen Duisburg, Germany {firstName.lastName}@stud.uni-due.de Ahmet Aker University of Duisburg-Essen Duisburg, Germany a.aker@is.inf.uni-due.de is also demonstrated that the influence of bad news is more significant than good news [13, 2] and that due Abstract to the natural negativity bias, as described by [11], humans may end up consuming more bad than good Reading news articles affects the mood and news. This is a real threat to the society as accord- mindset of the reader. Therefore we want to ing to medical doctors and, psychologists exposure to provide means to track our daily news con- bad news may have severe and long-lasting negative sumption activities. In this paper, we release effects for our well being and lead to stress, anxiety, news articles dataset assigned with good, bad and depression [8]. Furthermore, specific kinds of bad and neutral labels. The dataset comprises news, for example about unemployment, may affect of 300 news articles, each annotated by five stock markets and in turn, the overall economy [4]. different annotators. The agreement among In our ever-digitized world, with a constant influx the annotators is 0.526 according to Krippen- of news from a variety of sources, differentiating good dorff’s Alpha and 0.435 according to Fleis- and bad news may help the reader to combat this issue. sKappa. We also experiment with four dif- A system that filters news based on the content of ferent machine learning approaches such as the article, no matter the news website a person is Naive Bayes, SVM, Logistic Regression and following, may enable the user to control the amount Deep Learning using LSTM units. Our ex- of bad news they are consuming. Whilst most people periments show that NaiveBayes significantly start their day with reading the news, they can then outperforms the other three classifiers. start it on a positive note. To implement such a news filtering system we cre- 1 Introduction ated a gold standard dataset comprising 300 news ar- ticles annotated by five different raters with good, bad In the media, the presence of bad news seems to dom- and neutral labels. This dataset will be made publicly inate over good news. Every day there is at least a accessible and can be used for further research.1 report about terrorism, natural or human-made disas- The definitions of good, bad and neutral news may ter, a war crime, human right violation, airplane crash, widely vary from individual to individual and from etc. Studies show that news, in general, has a signif- country to country [7]. Therefore, we defined three icant impact on our mental stature [8]. However, it categories explicitly - what can be termed as good, bad or neutral news. To measure the quality of the Copyright c 2019 for the individual papers by the papers’ au- thors. Copying permitted for private and academic purposes. ratings we used Fleiss Kappa and Krippendorf’s Alpha This volume is published and copyrighted by its editors. to check for inter-rater reliability. We also evaluated In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, several machine learning techniques including Naive M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the Bayes, Logistic Regression, Support Vector Machines NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, published at http://ceur-ws.org 1 https://github.com/ahmetaker/goodBadNews and Deep Learning on the collected dataset. These Number of Articles 300 four techniques should give the first impression on the Average Sentences Count 24.23 complexity of the task and serve as baselines to fur- Average Word Count 497.83 ther improve the results. Our initial results show that Number of good news 52 Naive Bayes significantly outperforms the other three Number of bad news 131 approaches. Number of neutral news 117 In the first section of the paper, we define the terms good, bad and neutral news. We also describe the Table 1: Statistics about the corpus process of corpus collection and agreement on ratings. Next, in Section 3, we describe our methods of feature Fleiss Kappa 0.435 engineering and our baseline methods. In Section 4 we Krippendorffs Alpha 0.526 present our results. Finally, we conclude the paper in Table 2: Inter-rater agreement Section 5 with what can be done as future work. nature, animals or human rights. 2 Corpus Using these exemplified definitions we re-run the 2.1 Definition of good, bad and neutral news annotation process with another randomly selected 20 articles and this resulted in more satisfactory annota- According to the Collins English dictionary2 good tions so that we used this strategy to create our corpus. news is defined as “someone or something that is posi- tive, encouraging, uplifting, desirable, or the like” and 2.2 Corpus Collection bad news “someone or something regarded as unde- Using Newspaper3k3 , we randomly collected a corpus sirable”. For neutral news, we stated that neither of of 300 English news articles4 . The articles come from this is the case. We used these definitions to start our different news agencies such as BBC.co.uk, indepen- annotation. With these definitions, we run an initial dent.co.uk and entail topics from categories such as annotation process with 20 randomly selected news ar- economic, medical, international, local and emergent ticles. We asked 5 annotators who were undergraduate news. We used the exemplified definitions given above students, with ages varying from 20-25 years, fluent in to annotate these as good, bad or neutral news. The English and frequent online news readers to read the same five undergraduate students as above took part news and provide good, bad or neutral label accord- in the annotation task. After gathering the annota- ing to the above definitions. However, our annota- tions for all news articles, we took the majority of the tors found these definitions not unambiguous enough readers’ opinions as the final definition. If no clear ma- so that we revisited the design of our guidelines. This jority vote was found, we introduced a meta reviewer included using an exemplified definition instead. In who was not among the five annotators to give a final the following we briefly outline these exemplified defi- decision. Table 1 gives some stats about the corpus as nitions: well as the distribution of the different classes. Good News If the subject of the article is some- We also computed the agreement among the anno- one being saved from danger, the creation of medicine tators. To do this, we used Fleiss’ kappa and Krippen- which can cure or help with an illness, the end of a war dorff’s alpha. Table 2 shows the results for inter-rater or some kind of disaster, human rights being defended, agreement. From the table, we can see that the agree- or something that benefits the public or a dangerous ment is moderate indicating the difficulty of the task. culprit being arrested. Neutral News If the subject of the article is a pop- 3 Experiment ularization of science, history or geography, describing humanistic traditions, astronomy, nature, history or The task of good, bad or neutral news classification is landscape, scientific literature, news of people’s liveli- to classify a given online news article to one of those hood without casualties or daily entertainment and classes. To find a classifier suited for this task, we ex- fashion news. plore different traditional machine learning approaches Bad News If the subject of the article is a war, ac- as well as deep learning. In both cases, we only use cidents, disaster, epidemic disease or killing, criminal the article content to extract features. More precisely, activities, the death of a famous or important person, for the traditional machine learning techniques we use some sort of discrimination, bullying or stereotypes, Bag of Words (outlined in the next Section) and for some negative influence or event regarding economics, 3 https://pypi.org/project/newspaper3k/ 4 These 300 articles are exclusive from those 40 articles used 2 https://www.collinsdictionary.com/ to refine the annotation definitions. deep learning the lead parts of each article represented Logistic Regression is one of the most popular with word embeddings. supervised classification algorithms. Multinomial Lo- gistic Regression is the generalization of the Logistic 3.1 Feature Engineering Regression algorithm which can be used to conduct when the dependent variable is nominal with more For the traditional machine learning approaches, we than two levels. It is a model that is used to predict use Bag of Words (BoW) as the only feature category. the probabilities of the different possible outcomes of In total, our vocabulary contains 19000 tokens includ- a categorically distributed dependent variable, given ing stop words, digits, inflected forms of the words, a set of independent variables. Using Grid-search, we etc. We use the following pre-processing steps to re- set C to 50 and regularization to l2. duce the vocabulary size to 13000 words: The SVM problem is to find the decision hyper- • Lower casing the article texts. plane that can maximize the margin between the • Removing stop-words. data points of the classes [5]. Corresponding to our Grid-search analysis, we use a linear kernel and set C • Removing digits and punctuation marks. to 10. • Removing contractions. Our deep learning model comprises a simple LSTM layer [1] that is capable to consider sequential • Depicting all numbers as #. information. The input of the LSTM (50 LSTMs) layer • Lemmatizing the words. is word embeddings. We obtain the embeddings from the input documents. Note, as stated above instead us- Each of the words is represented using term fre- ing the entire article as input we use only the lead part quency (TF) (number of times a word occurs in a par- of each article which can be considered as the sum- ticular news article) and inverse document frequency mary of news article [14]. For simplicity and also to (IDF) (number of articles from the corpus the word have a common input length across all the articles we appears in). We further reduce the vocabulary size by use the first 400 words of each article as the lead part only using the significant words. For this, we use the of the article. We use a Dropout layer after the LSTM Chi-square test and select those words that were sig- (0.1), which is followed by a dense layer (50 units with nificant in discriminating the classes. After this step, ReLu activation) and then again by a Dropout layer the vocabulary contains around 3600 words. We use (0.35) and finally by a SoftMax layer. We use Adam these words represented using TF*IDF to guide our as the optimization function with 0.001 learning rate traditional machine learning approaches. and Xavier Initialization for weight initialization. The For the deep learning technique, we use the lead loss is determined by categorical crossentropy together part of each article, convert each word in this part with l2 regularization. Our batch size is 64, and Epoch into word embeddings and use these to represent each number is set to 40. article. 4 Results 3.2 Baselines The results of the performances of the different clas- As baselines, we experiment with Naive Bayes classi- sifiers are presented in Table 3. In all cases, we used fier, Support Vector Machines, Multinomial Logistic 10-fold cross-validation and report in macro-averaged Regression and a deep learning model using LSTMs. F1 measure, precision and recall. From the results, Naive Bayes is often used in text classification we see that the best performing classifier is the Naive applications and experiments because of its simplicity Bayes outperforming all the other classifiers. Signifi- and effectiveness [10]. It uses a probabilistic model cance test using paired t-test with Bonferroni correc- of text. Naive Bayes classifier is highly scalable, tion (p < 0.0125) [3] shows that the Naive Bases clas- requiring several parameters linear in the number of sifier significantly outperforms the other classifiers. variables (features/predictors) in a learning problem [12]. Maximum-likelihood training can be done by 5 Conclusion and Future Work evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approx- In this paper, we propose to release a dataset con- imation as used for many other types of classifiers. taining news articles annotated with good, bad and Determined by grid-search, we set alpha to 0.01. neutral labels. We have a total of 300 news articles in our dataset where each article has been annotated Classifier Accuracy Precision Recall F1 GRK 2167, Research Training Group “User-Centred NaiveBayes 0.829 0.828 0.796 0.799 Social Media”. SVM 0.717 0.517 0.583 0.533 LogReg 0.700 0.475 0.565 0.511 References LSTM 0.594 0.415 0.478 0.533 [1] Bahdanau, D., Cho, K., and Bengio, Y. Table 3: Overall Classifier Performance Comparison Neural machine translation by jointly learn- ing to align and translate. arXiv preprint by five different annotators. We computed the inter- arXiv:1409.0473 (2014). rater agreement using Krippendorff’s Alpha and Fleiss Kappa. According to Krippendorff’s Alpha, the agree- [2] Baumeister, R. F., Bratslavsky, E., Finke- ment is 0.526 and according to Fleiss Kappa 0.435. We nauer, C., and Vohs, K. D. Bad is stronger also experiment with four different machine learning than good. Review of General Psychology 5, 4 approaches such as Naive Bayes, SVM, Logistic Re- (2001), 323–370. gression and Deep Learning using LSTM to provide initial results on the task. Our experiments show that [3] Bland, J. M., and Altman, D. G. Multiple Naive Bayes significantly outperforms the other three significance tests: the bonferroni method. Bmj classifiers. 310, 6973 (1995), 170. In the future, we plan to extend the dataset. This [4] Boyd, J. H., Hu, J., and Jagannathan, R. would allow the approaches to gain more stability, The stock market’s reaction to unemployment especially the deep learning strategies whose perfor- news: Why bad news is usually good for stocks. mance rely on bigger training data. We also plan to Journal of Finance 60, 2 (2005), 649–672. investigate features other than Bag of Words to cap- ture sentiments, emotions and similar linguistic as- [5] Colas F., B. P. Comparison of svm and some pects that better distinguish between bad and good older classification algorithms in text classifica- news. tion tasks. in: Bramer m. (eds) artificial intel- ligence in theory and practice. IFIP Interna- 6 Application tional Federation for Information Processing 217 (2006). Nowadays, the amount of online news content is im- mense and its sources are very diverse. For the readers [6] Fuhr, N., Nejdl, W., Peters, I., Stein, and other consumers of online news who value bal- B., Giachanou, A., Grefenstette, G., anced, diverse and reliable information, it is necessary Gurevych, I., Hanselowski, A., Jarvelin, to have access to additional information to evaluate K., Jones, R., Liu, Y., and Mothe, J. An in- the news articles available to them. For this purpose, formation nutritional label for online documents. Fuhr et al. [6] propose to label every online news ar- ACM SIGIR Forum 51, 3 (feb 2018), 46–66. ticle with information nutrition labels to describe the ingredients of the article and thus give the reader a [7] Giner, B., and Rees, W. On the asymmetric chance to evaluate what she is reading. This concept recognition of good and bad news in france, ger- is analogous to food packages where nutrition labels many and the united kingdom. Journal of Busi- help buyers in their decision making. The authors dis- ness Finance & Accounting 28, 910, 1285–1331. cuss 9 different information nutrition including sen- timent, subjectivity, objectivity, ease of reading, etc. [8] Johnston, W. M., and Davey, G. C. L. The We propose the bad/good/neutral classification as an psychological impact of negative tv news bul- additional information nutrition label and plan to im- letins: The catastrophizing of personal worries. plement this in our freely available News-Scan5 tool British Journal of Psychology 88, 1 (1997), 85– [9]. This tool is a browser plugin that can be evoked 91. by users to obtain nutrition labels for the articles they are currently reading. [9] Kevin, V., Högden, B., Schwenger, C., Sa- han, A., Madan, N., Aggarwal, P., Ban- garu, A., Muradov, F., and Aker, A. Infor- 7 ACKNOWLEDGEMENTS mation nutrition labels: A plugin for online news This work was funded by the Deutsche Forschungs- evaluation. In Proceedings of the First Workshop gemeinschaft (DFG, German Research Foundation) - on Fact Extraction and VERification (FEVER) (Brussels, Belgium, Nov. 2018), Association for 5 www.news-scan.com Computational Linguistics, pp. 28–33. [10] Kim S. B., Rim H. C., Y. D. S., and S, L. H. Effective methods for improving naive bayes text classifiers. LNAI 2417 (2002), 414–423. [11] Rozin, P., and Royzman, E. B. Negativity bias, negativity dominance, and contagion. Per- sonality and Social Psychology Review 5, 4 (2001), 296–320. [12] Russell, Stuart; Norvig, P. Artificial Intel- ligence: A Modern Approach(2nd ed.). Prentice- Hall, 2003. [13] Soroka, S. N. Good news and bad news: Asym- metric responses to economic information. The Journal of Politics 68, 2 (2006), 372–385. [14] Wasson, M. Using leading text for news sum- maries: Evaluation results and implications for commercial summarization applications. In COL- ING 1998 Volume 2: The 17th International Conference on Computational Linguistics (1998), vol. 2.