=Paper= {{Paper |id=Vol-2411/paper2 |storemode=property |title=Good , Neutral or Bad - News Classification |pdfUrl=https://ceur-ws.org/Vol-2411/paper2.pdf |volume=Vol-2411 |authors=Aashish Agarwal,Ankita Mandal,Matthias Schaffeld,Fangzheng Ji,Jhiao Zhan,Yiqi Sun,Ahmet Aker |dblpUrl=https://dblp.org/rec/conf/sigir/AgarwalMSJZSA19 }} ==Good , Neutral or Bad - News Classification== https://ceur-ws.org/Vol-2411/paper2.pdf

Good, Neutral or Bad - News Classification

Aashish Agarwal, Ankita Mandal, Matthias Schaffeld, Fangzheng Ji, Jihao Zhang, Yiqi Sun
University of Duisburg-Essen
Duisburg, Germany
{firstName.lastName}@stud.uni-due.de

Ahmet Aker
University of Duisburg-Essen
Duisburg, Germany
a.aker@is.inf.uni-due.de

is also demonstrated that the influence of bad news is
more significant than good news [13, 2] and that due
Abstract to the natural negativity bias, as described by [11],
humans may end up consuming more bad than good
Reading news articles affects the mood and news. This is a real threat to the society as accord-
mindset of the reader. Therefore we want to ing to medical doctors and, psychologists exposure to
provide means to track our daily news con- bad news may have severe and long-lasting negative
sumption activities. In this paper, we release effects for our well being and lead to stress, anxiety,
news articles dataset assigned with good, bad and depression [8]. Furthermore, specific kinds of bad
and neutral labels. The dataset comprises news, for example about unemployment, may affect
of 300 news articles, each annotated by five stock markets and in turn, the overall economy [4].
different annotators. The agreement among In our ever-digitized world, with a constant influx
the annotators is 0.526 according to Krippen- of news from a variety of sources, differentiating good
dorff’s Alpha and 0.435 according to Fleis- and bad news may help the reader to combat this issue.
sKappa. We also experiment with four dif- A system that filters news based on the content of
ferent machine learning approaches such as the article, no matter the news website a person is
Naive Bayes, SVM, Logistic Regression and following, may enable the user to control the amount
Deep Learning using LSTM units. Our ex- of bad news they are consuming. Whilst most people
periments show that NaiveBayes significantly start their day with reading the news, they can then
outperforms the other three classifiers. start it on a positive note.
To implement such a news filtering system we cre-
1 Introduction ated a gold standard dataset comprising 300 news ar-
ticles annotated by five different raters with good, bad
In the media, the presence of bad news seems to dom-
and neutral labels. This dataset will be made publicly
inate over good news. Every day there is at least a
accessible and can be used for further research.1
report about terrorism, natural or human-made disas-
The definitions of good, bad and neutral news may
ter, a war crime, human right violation, airplane crash,
widely vary from individual to individual and from
etc. Studies show that news, in general, has a signif-
country to country [7]. Therefore, we defined three
icant impact on our mental stature [8]. However, it
categories explicitly - what can be termed as good,
bad or neutral news. To measure the quality of the
Copyright c 2019 for the individual papers by the papers’ au-
thors. Copying permitted for private and academic purposes. ratings we used Fleiss Kappa and Krippendorf’s Alpha
This volume is published and copyrighted by its editors. to check for inter-rater reliability. We also evaluated
In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, several machine learning techniques including Naive
M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the Bayes, Logistic Regression, Support Vector Machines
NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019,
published at http://ceur-ws.org 1 https://github.com/ahmetaker/goodBadNews
and Deep Learning on the collected dataset. These Number of Articles 300
four techniques should give the first impression on the Average Sentences Count 24.23
complexity of the task and serve as baselines to fur- Average Word Count 497.83
ther improve the results. Our initial results show that Number of good news 52
Naive Bayes significantly outperforms the other three Number of bad news 131
approaches. Number of neutral news 117
In the first section of the paper, we define the terms
good, bad and neutral news. We also describe the Table 1: Statistics about the corpus
process of corpus collection and agreement on ratings.
Next, in Section 3, we describe our methods of feature Fleiss Kappa 0.435
engineering and our baseline methods. In Section 4 we Krippendorffs Alpha 0.526
present our results. Finally, we conclude the paper in
Table 2: Inter-rater agreement
Section 5 with what can be done as future work.
nature, animals or human rights.
2 Corpus Using these exemplified definitions we re-run the
2.1 Definition of good, bad and neutral news annotation process with another randomly selected 20
articles and this resulted in more satisfactory annota-
According to the Collins English dictionary2 good tions so that we used this strategy to create our corpus.
news is defined as “someone or something that is posi-
tive, encouraging, uplifting, desirable, or the like” and 2.2 Corpus Collection
bad news “someone or something regarded as unde-
Using Newspaper3k3 , we randomly collected a corpus
sirable”. For neutral news, we stated that neither of
of 300 English news articles4 . The articles come from
this is the case. We used these definitions to start our
different news agencies such as BBC.co.uk, indepen-
annotation. With these definitions, we run an initial
dent.co.uk and entail topics from categories such as
annotation process with 20 randomly selected news ar-
economic, medical, international, local and emergent
ticles. We asked 5 annotators who were undergraduate
news. We used the exemplified definitions given above
students, with ages varying from 20-25 years, fluent in
to annotate these as good, bad or neutral news. The
English and frequent online news readers to read the
same five undergraduate students as above took part
news and provide good, bad or neutral label accord-
in the annotation task. After gathering the annota-
ing to the above definitions. However, our annota-
tions for all news articles, we took the majority of the
tors found these definitions not unambiguous enough
readers’ opinions as the final definition. If no clear ma-
so that we revisited the design of our guidelines. This
jority vote was found, we introduced a meta reviewer
included using an exemplified definition instead. In
who was not among the five annotators to give a final
the following we briefly outline these exemplified defi-
decision. Table 1 gives some stats about the corpus as
nitions:
well as the distribution of the different classes.
Good News If the subject of the article is some- We also computed the agreement among the anno-
one being saved from danger, the creation of medicine tators. To do this, we used Fleiss’ kappa and Krippen-
which can cure or help with an illness, the end of a war dorff’s alpha. Table 2 shows the results for inter-rater
or some kind of disaster, human rights being defended, agreement. From the table, we can see that the agree-
or something that benefits the public or a dangerous ment is moderate indicating the difficulty of the task.
culprit being arrested.
Neutral News If the subject of the article is a pop- 3 Experiment
ularization of science, history or geography, describing
humanistic traditions, astronomy, nature, history or The task of good, bad or neutral news classification is
landscape, scientific literature, news of people’s liveli- to classify a given online news article to one of those
hood without casualties or daily entertainment and classes. To find a classifier suited for this task, we ex-
fashion news. plore different traditional machine learning approaches
Bad News If the subject of the article is a war, ac- as well as deep learning. In both cases, we only use
cidents, disaster, epidemic disease or killing, criminal the article content to extract features. More precisely,
activities, the death of a famous or important person, for the traditional machine learning techniques we use
some sort of discrimination, bullying or stereotypes, Bag of Words (outlined in the next Section) and for
some negative influence or event regarding economics, 3 https://pypi.org/project/newspaper3k/
4 These 300 articles are exclusive from those 40 articles used
2 https://www.collinsdictionary.com/ to refine the annotation definitions.
deep learning the lead parts of each article represented Logistic Regression is one of the most popular
with word embeddings. supervised classification algorithms. Multinomial Lo-
gistic Regression is the generalization of the Logistic
3.1 Feature Engineering Regression algorithm which can be used to conduct
when the dependent variable is nominal with more
For the traditional machine learning approaches, we than two levels. It is a model that is used to predict
use Bag of Words (BoW) as the only feature category. the probabilities of the different possible outcomes of
In total, our vocabulary contains 19000 tokens includ- a categorically distributed dependent variable, given
ing stop words, digits, inflected forms of the words, a set of independent variables. Using Grid-search, we
etc. We use the following pre-processing steps to re- set C to 50 and regularization to l2.
duce the vocabulary size to 13000 words:
The SVM problem is to find the decision hyper-
• Lower casing the article texts.
plane that can maximize the margin between the
• Removing stop-words. data points of the classes [5]. Corresponding to our
Grid-search analysis, we use a linear kernel and set C
• Removing digits and punctuation marks. to 10.

• Removing contractions. Our deep learning model comprises a simple
LSTM layer [1] that is capable to consider sequential
• Depicting all numbers as #. information. The input of the LSTM (50 LSTMs) layer
• Lemmatizing the words. is word embeddings. We obtain the embeddings from
the input documents. Note, as stated above instead us-
Each of the words is represented using term fre- ing the entire article as input we use only the lead part
quency (TF) (number of times a word occurs in a par- of each article which can be considered as the sum-
ticular news article) and inverse document frequency mary of news article [14]. For simplicity and also to
(IDF) (number of articles from the corpus the word have a common input length across all the articles we
appears in). We further reduce the vocabulary size by use the first 400 words of each article as the lead part
only using the significant words. For this, we use the of the article. We use a Dropout layer after the LSTM
Chi-square test and select those words that were sig- (0.1), which is followed by a dense layer (50 units with
nificant in discriminating the classes. After this step, ReLu activation) and then again by a Dropout layer
the vocabulary contains around 3600 words. We use (0.35) and finally by a SoftMax layer. We use Adam
these words represented using TF*IDF to guide our as the optimization function with 0.001 learning rate
traditional machine learning approaches. and Xavier Initialization for weight initialization. The
For the deep learning technique, we use the lead loss is determined by categorical crossentropy together
part of each article, convert each word in this part with l2 regularization. Our batch size is 64, and Epoch
into word embeddings and use these to represent each number is set to 40.
article.
4 Results
3.2 Baselines
The results of the performances of the different clas-
As baselines, we experiment with Naive Bayes classi- sifiers are presented in Table 3. In all cases, we used
fier, Support Vector Machines, Multinomial Logistic 10-fold cross-validation and report in macro-averaged
Regression and a deep learning model using LSTMs. F1 measure, precision and recall. From the results,
Naive Bayes is often used in text classification we see that the best performing classifier is the Naive
applications and experiments because of its simplicity Bayes outperforming all the other classifiers. Signifi-
and effectiveness [10]. It uses a probabilistic model cance test using paired t-test with Bonferroni correc-
of text. Naive Bayes classifier is highly scalable, tion (p < 0.0125) [3] shows that the Naive Bases clas-
requiring several parameters linear in the number of sifier significantly outperforms the other classifiers.
variables (features/predictors) in a learning problem
[12]. Maximum-likelihood training can be done by
5 Conclusion and Future Work
evaluating a closed-form expression, which takes
linear time, rather than by expensive iterative approx- In this paper, we propose to release a dataset con-
imation as used for many other types of classifiers. taining news articles annotated with good, bad and
Determined by grid-search, we set alpha to 0.01. neutral labels. We have a total of 300 news articles
in our dataset where each article has been annotated
Classifier Accuracy Precision Recall F1 GRK 2167, Research Training Group “User-Centred
NaiveBayes 0.829 0.828 0.796 0.799 Social Media”.
SVM 0.717 0.517 0.583 0.533
LogReg 0.700 0.475 0.565 0.511 References
LSTM 0.594 0.415 0.478 0.533
[1] Bahdanau, D., Cho, K., and Bengio, Y.
Table 3: Overall Classifier Performance Comparison Neural machine translation by jointly learn-
ing to align and translate. arXiv preprint
by five different annotators. We computed the inter- arXiv:1409.0473 (2014).
rater agreement using Krippendorff’s Alpha and Fleiss
Kappa. According to Krippendorff’s Alpha, the agree- [2] Baumeister, R. F., Bratslavsky, E., Finke-
ment is 0.526 and according to Fleiss Kappa 0.435. We nauer, C., and Vohs, K. D. Bad is stronger
also experiment with four different machine learning than good. Review of General Psychology 5, 4
approaches such as Naive Bayes, SVM, Logistic Re- (2001), 323–370.
gression and Deep Learning using LSTM to provide
initial results on the task. Our experiments show that [3] Bland, J. M., and Altman, D. G. Multiple
Naive Bayes significantly outperforms the other three significance tests: the bonferroni method. Bmj
classifiers. 310, 6973 (1995), 170.
In the future, we plan to extend the dataset. This
[4] Boyd, J. H., Hu, J., and Jagannathan, R.
would allow the approaches to gain more stability,
The stock market’s reaction to unemployment
especially the deep learning strategies whose perfor-
news: Why bad news is usually good for stocks.
mance rely on bigger training data. We also plan to
Journal of Finance 60, 2 (2005), 649–672.
investigate features other than Bag of Words to cap-
ture sentiments, emotions and similar linguistic as- [5] Colas F., B. P. Comparison of svm and some
pects that better distinguish between bad and good older classification algorithms in text classifica-
news. tion tasks. in: Bramer m. (eds) artificial intel-
ligence in theory and practice. IFIP Interna-
6 Application tional Federation for Information Processing 217
(2006).
Nowadays, the amount of online news content is im-
mense and its sources are very diverse. For the readers [6] Fuhr, N., Nejdl, W., Peters, I., Stein,
and other consumers of online news who value bal- B., Giachanou, A., Grefenstette, G.,
anced, diverse and reliable information, it is necessary Gurevych, I., Hanselowski, A., Jarvelin,
to have access to additional information to evaluate K., Jones, R., Liu, Y., and Mothe, J. An in-
the news articles available to them. For this purpose, formation nutritional label for online documents.
Fuhr et al. [6] propose to label every online news ar- ACM SIGIR Forum 51, 3 (feb 2018), 46–66.
ticle with information nutrition labels to describe the
ingredients of the article and thus give the reader a [7] Giner, B., and Rees, W. On the asymmetric
chance to evaluate what she is reading. This concept recognition of good and bad news in france, ger-
is analogous to food packages where nutrition labels many and the united kingdom. Journal of Busi-
help buyers in their decision making. The authors dis- ness Finance & Accounting 28, 910, 1285–1331.
cuss 9 different information nutrition including sen-
timent, subjectivity, objectivity, ease of reading, etc. [8] Johnston, W. M., and Davey, G. C. L. The
We propose the bad/good/neutral classification as an psychological impact of negative tv news bul-
additional information nutrition label and plan to im- letins: The catastrophizing of personal worries.
plement this in our freely available News-Scan5 tool British Journal of Psychology 88, 1 (1997), 85–
[9]. This tool is a browser plugin that can be evoked 91.
by users to obtain nutrition labels for the articles they
are currently reading. [9] Kevin, V., Högden, B., Schwenger, C., Sa-
han, A., Madan, N., Aggarwal, P., Ban-
garu, A., Muradov, F., and Aker, A. Infor-
7 ACKNOWLEDGEMENTS mation nutrition labels: A plugin for online news
This work was funded by the Deutsche Forschungs- evaluation. In Proceedings of the First Workshop
gemeinschaft (DFG, German Research Foundation) - on Fact Extraction and VERification (FEVER)
(Brussels, Belgium, Nov. 2018), Association for
5 www.news-scan.com Computational Linguistics, pp. 28–33.
[10] Kim S. B., Rim H. C., Y. D. S., and S, L. H.
Effective methods for improving naive bayes text
classifiers. LNAI 2417 (2002), 414–423.

[11] Rozin, P., and Royzman, E. B. Negativity
bias, negativity dominance, and contagion. Per-
sonality and Social Psychology Review 5, 4 (2001),
296–320.

[12] Russell, Stuart; Norvig, P. Artificial Intel-
ligence: A Modern Approach(2nd ed.). Prentice-
Hall, 2003.
[13] Soroka, S. N. Good news and bad news: Asym-
metric responses to economic information. The
Journal of Politics 68, 2 (2006), 372–385.
[14] Wasson, M. Using leading text for news sum-
maries: Evaluation results and implications for
commercial summarization applications. In COL-
ING 1998 Volume 2: The 17th International
Conference on Computational Linguistics (1998),
vol. 2.