Understanding Characteristics of Biased Sentences in News Articles Sora Lim Adam Jatowt Masatoshi Yoshikawa Kyoto University Kyoto University Kyoto University Kyoto, Japan Kyoto, Japan Kyoto, Japan lim.sora.88u@st.kyoto-u.ac.jp adam@dl.kuis.kyoto-u.ac.jp yoshikawa@i.kyoto-u.ac.jp own views towards the society, politics and other top- ics. Furthermore, they need to attract readers to make Abstract their businesses profitable. This frequently leads to the potentially harmful reporting style resulting in biased Providing balanced and good quality news ar- news. ticles to readers is an important challenge in To overcome news bias, as a remedy, users often news recommendation. Often, readers tend try to choose news articles from news sources (outlets) to select and read articles which confirm their which are known to be relatively unbiased. Ideally, this social environment and their political beliefs. should be performed by corresponding recommender This issue is also known as filter bubble. As a systems. However, bias-free article recommendations remedy, initial approaches towards automati- are still not feasible given the state-of-the-art. Fur- cally detecting bias in news articles have been thermore, the recommendations might not be trusted developed. Obtaining a suitable ground truth by users, as readers often need concrete evidence of for such a task is however difficult. In this pa- bias in the form of bias-inducing words and similar per, we describe ground truth dataset created aspects. with the help of crowd-sourcing for fostering In this paper, we focus on understanding news bias research on bias detection and removal from and on developing a high-quality gold standard for news content. We then analyze the charac- fostering bias-detection studies on the sentence and teristics of the user annotations, in particular word levels. We assume here that word choices made concerning bias-inducing words. Our results by articles’ authors might reflect some bias in terms indicate that determining bias-induced words of their viewpoint. For example, the phrases “illegal is subjective to certain degree and that a high immigrants” and “undocumented immigrants” chosen agreement on all bias-inducing words of all by news reporters to refer to immigrants in relation readers is hard to obtain. We also study the to Donald Trump’s decision to rescind Deferred Ac- discriminative characteristics of biased con- tion for Childhood Arrivals may be considered as case tent and find that linguistic features, such as where the choice of words can result in a bias. Here, negative words, tend to be indicative for bias. the use of the word “illegal” degrades the immigrants by inducing more negative value than in the case of us- 1 Introduction ing the adjective “undocumented”. By such nuanced In news reporting it is important for both authors word choices, news authors may imply their stance on and readers to maintain high fairness, accuracy, and the news event and deliver biased view to the readers. to keep balance between different view points. How- It is, however, challenging to identify words that ever, bias in news articles has become a major issue cause the article to have biased points of view [GM05, Ben16] even though many news outlets claim [BEQ+ 15]. The bias inherent in news articles tend to to have dedicated policy to assure the objectiveness in be subtle and intricate. In this research, we construct their articles. Different news sources may have their a comparable news dataset which consists of news ar- ticles reporting the same news event. The objective is Copyright © CIKM 2018 for the individual papers by the papers' to help designing methods to detect bias triggers1 and authors. Copyright © CIKM 2018 for the volume as a collection 1 https://github.com/skymoonlight/newsdata-bias by its editors. This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC BY 4.0). shed new light on the way in which users recognize Table 1: Statistics of Labeled Sentences bias in news articles. To the best of our knowledge, this is the first dataset with annotated bias words in Total number of news articles 88 news articles. In the following, we describe the design Total number of sentences 1,235 of the crowd-sourcing task to obtain the bias labels Average tagged sentences per a news article 73.48% for the news words and we subsequently analyze the No. of sentences including tagged words 826 (66.88%) characteristics of detected biased content in news. No. of tagged sentences on agreement level 2 431 (34.90%) No. of tagged sentences on agreement level 3 173 (14.01%) No. of tagged sentences on agreement level 4 42 (3.40%) 2 Related Works No. of tagged sentences on agreement level 5 7 (0.57%) Several prior works have focused on media bias in gen- eral and news bias in particular. Generally, accord- guage in political news as well as features from theo- ing to D’Allessio and Allen [DA00], media bias can retical literature on framing. be divided into three different types: (1) gatekeeping, (2) coverage and (3) statement bias. Gatekeeping bias 3 Annotating Bias in News Articles is a selection of stories out of the potential stories; 3.1 Dataset coverage bias expresses how much space specific po- sitions receive in media; statement bias, in contrast, To detect the subtle differences which cause bias, one denotes how an author’s own opinion is woven into a way is to compare words across the content of different text. Similarly, Alsem et al. [ABHK08] divide news news articles which are reporting the same news event. bias into ideology and spin. Ideology reflects news out- This should allow for pinpointing differences in the lets’ desire to affect readers’ opinions in a particular subtle use of words by different authors from diverse direction. Spin reflects the outlet’s attempt to simply media outlets to describe the same event. Although, create a memorable story. Given these distinctions, we many news datasets were created for news analysis, to consider the bias type tackled in this paper as state- the best of our knowledge, none focused on a single ment bias w.r.t. [DA00] and as spin bias according to event while, at the same time, covering many news [ABHK08]. articles from various news outlets from a short time Several researches made efforts to provide effective range. means for solving the news bias problem. However, We selected the news event titled “Black men ar- most of them have focused on the news diversification rested in Starbucks” which has caused controversial according to the content similarity and the political discussions on racism. The event happened on April stance of news outlets. Park et al. [PKCS09], for in- 12, 2018. We focused on news articles written on April stance, have developed a news diversification system, 15, 2018 as the event was widely reported in different named NewsCube, to mitigate the bias problem by pro- news on that day. viding diverse information to the users. Hambourg For collecting news articles from various news out- et al. [HMG17] presented a matrix-based news analy- lets we used Google News2 . Google News is a conve- sis to display various perspectives for the same news nient source for our case as it already clusters news topic in a two-dimensional matrix. An et al. [ACG+ 12] articles concerning the same event coming from vari- revealed skewness of news outlets by analyzing their ous sources. We first crawled all news articles available news contents spread throughout tweets. online that described the aforementioned event. Based Alonso et al. [ADS17] focused on omissions between on manual inspection, we then verified whether all arti- news statements which are similar but not identical. cles are about the same news event. We next extracted The omission occupies one category in news bias in the titles and text content from the crawled pages ig- that it is a means of statement bias [GS06]. Ogawa et noring pages which covered only pictures or contained al. [OMY11] attempted to describe the relationship be- only a single sentence. In the end, our dataset con- tween main participants in news articles to detect news sists of 89 news articles with 1,235 sentences and 2,542 bias. To catch describing way of the relationship, they unique words from 83 news outlets. Articles contain expanded sentiment words in SentiWordNet [BES10]. on average 14 paragraphs. Other works focused on linguistic analysis for bias detection on text data. Recasens et al. [RDJ13] tar- 3.2 Bias Labeling via Crowd-Sourcing geted detecting bias words from the revised sentence To overcome scalability issue in annotations, crowd- history in Wikipedia. They utilized NPOV tags for sourcing has been widely used [FMK+ 10, ZLP+ 15]. bias labels, and linguistically categorized resources for We also use crowdsourcing to collect bias labels and the bias feature. Baumer et al. [BEQ+ 15] used Re- casens et al.’s linguistic features to identify biased lan- 2 https://news.google.com/?hl=en-US&gl=US&ceid=US:en we choose Figure Eight3 as our platform. Figure Eight 88 documents, we collected 2,982 bias words (1,647 (called CrowdFlower until March 2018) has been used unique words) covered by 1,546 non-overlapping an- in a variety of annotation tasks and is especially suit- notations. able for our purposes due to the focus on producing high-quality annotations. We note that it is difficult 3.3 Analysis of Perceived News Bias to obtain bias-related label information such as binary judgements on each sentence of news articles, as the We next analyze what kind of words are tagged as bias bias may depend on the news event and its context. triggers by the workers. First, we analyze the phrases To design the bias labeling task, we divided the news annotated as biased in terms of the word length. Each dataset into one reference news article4 and 88 target annotation consists of four words on average (examples news articles. Having a reference news article, users being “did absolutely nothing wrong”, “putting them could first get familiar with the overall event. Fur- in handcuffs”, “racism and racial profiling”, “merely thermore, the motivation was to have some reference for their race”, and “Starbucks manager was white”). text which being relatively bias-free allows for detect- Most answers submitted by workers are, however, sin- ing bias content in a target article. Our reference ar- gle words, for example, “accuse”, “absurd”, “boy- ticle has been selected after being manually judged as cott”, “discrimination”, and “outrage”. These exam- relatively unbiased according to several annotators. ples also show a tendency of negative sentiment and We let the workers make judgements on each tar- that rather extreme, emotion-related words are anno- get news article (using also the reference news article). tated, which could be extracted almost without consid- Each article has been independently annotated by 5 ering the context. As second most frequent phrase pat- workers. In order to ensure a high-quality labeling, tern, three words in a sentence have been annotated, we produced various test questions to filter out low such as “absolutely nothing wrong”, “accusations of quality answers. To create reliable answers to our test racism”, “black men arrested”, “who is black”, and questions, we conducted a preliminary labeling task on “other white ppl”. These are typical combinations of a set of five randomly selected news articles from the sentiment words and modifiers or intensifiers. These same news collection, plus the same reference news sentiment words (with positive or negative polarity) article used for comparison. Nine graduate students are typically associated with the overall topic or event (male: 6, female: 3) labeled bias-inducing words in and can also be considered as outstanding or salient these news articles. The words which have been la- to some degree. beled as “bias-inducing” by at least two people were We aggregated the answers of the crowd-workers on considered as “biased” in general and served as ground the sentence level assuming that if a sentence includes truth for our test questions. any word annotated as biased, the sentence itself is The instructions and main questions given to the biased. Note that the information on sentence level workers in the crowdsourcing tasks and to annotators bias might be enough for the purpose of automatic in the preliminary task can be summarized as follows: bias detection. However, we let users annotate the 1. Read the target news article and the reference news specific bias-inducing phrases, since this lets us gain a article. fine-grained insight in the actual thoughts of users and 2. Check the degree of bias of the target news article by allows to choose appropriate machine learning features comparing with the reference news article. for bias-detection algorithms, as well as to show con- • not at all biased, slightly biased, fairly biased, strongly biased. crete evidence of bias-inducing aspects in the texts to 3. Select and submit words or phrases which cause the users. Table 1 shows the statistics of the dataset and bias, compared to the reference news article. labeled results. Agreement level n denotes that only • Submit words or phrases with the line identifier. annotations tagged by at least n people are consid- • Try to submit as short as possible content and ered. When we only consider the unique, i.e., fusioned don’t submit whole paragraphs. answers from the workers, among 1,235 sentences in • If no bias inducing words are found, submit the whole data set, 826 sentences (66.88%) included “none”. bias-annotated words. On average, 73.48% of the sen- 4. Select your level of understanding of the news story tences would be then considered potentially biased in • four scale ratings from “I didn’t understand at an article. Yet, assuming an agreement of 2 workers all.” to “I understood well.” the average number of biased sentences is 34.9%, while In total, 60 workers participated in the task. We for n = 3 the corresponding number is 14.01%. These only used the answers from 25 reliable workers who statistics reveal that people consider different words as passed at least 50% of test questions. Overall, for representing biased content through different words. 3 https://www.figure-eight.com/. Inter-rater agreement. We next investigated the 4 https://reut.rs/2ve3rMz inter-rater agreement among the five workers’ answers 1.0 Table 2: POS Feature Effects by t-test in Each Agree- 0.8 ment Level5 0.6 0.4 Agreement Level 1 3 5 0.2 Cardinal number (CC) 5.19 4.0554 Determiner (DT) 4.87 -4.4403 0.0 Existential there (EX) 3.81 -6.9333 0.2 Preposition/subordinating 7.63 3.4378 Krippendorff's Alpha Pairwise Jaccard participle conjunction (IN) Adjective (JJ) 9.2987 3.4507 Figure 1: Inter-rater reliability on the Crowdsourcing Adjective, superative (JJS) -7.6947 result: (a) Krippendorff’s alpha (b) Pairwise Jaccard. Noun (NN) 7.5422 Noun, plural (NNS) 5.3969 Predeterminer (PDT) 3.7788 -8.7549 for the each target news. We calculated Krippen- Adverb 5.3142 dorff’s alpha and pairwise Jaccard similarity coeffi- Adverb, superative (RBR) -3.4822 -3.4797 cients. Krippendorff’s alpha are used for quantifying Particle 5.6674 -11.969 the extent of agreement among multiple raters, and Verb, past tense (VBD) 6.5408 Jaccard similarity is mainly used for comparing the Verb, gerund/present 7.4645 3.3702 similarity between two sets. Here, we regard each (VBG) sentence in a target news as item to be measured. Verb, past participle (VBN) 8.2355 4.0162 -2.6979 The mean scores calculated over all the target articles Verb, 3rd ps. sing. present 6.1593 3.713 are 0.513 for Krippendorff’, and 0.222 for Jaccard, as (VBZ) shown also in Figure 1. The agreement scores show Wh-pronoun (WP) 5.4197 2.4701 relatively low tendency which means the answers from Wh-adverv (WRB) -15.243 the five workers are diverse and with slight agreement. In practice, it is hard to get substantial agreement on the arrest, therefore, many negative words affect to news articles in general [NR10]. This may have several the bias cognition of users. Interestingly, factive verbs reasons in our case: Firstly, the degree of perception do not show any significant difference. concerning bias differs from person to person. Sec- For the preliminary experiments, we next use the ondly, the answer coverage by people is different and POS tags and the mentioned linguistic features for imperfect. For example, some people might feel it is approaching the task of automatically detecting bias. enough to submit around five different answers on a We employ a standard SVM model and use randomly target news article, while others might try to find as selected 80% of the sentences for training the model many as possible evidences of biased content. It is then and the remaining 20% of sentences for testing. The hard to decide whether the differences are from insin- classification accuracy is 70%. As our data set is pri- cerity of individuals or the matter of their perception. marily designed for linguistic analysis, larger numbers Analysis of POS tags. We investigated the part of train/test examples are needed for obtaining more of speech tags included in the sentences. The Stanford reliable evaluation results. POS Tagger [TKMS03] was employed in this process. Further extensions. We analyzed bias in the To that end, we considered different agreement levels, news sentences perceived by people using crowdsourc- i.e., the minimum number of users who tag words as ing. In this research, we used a news event that oc- biased in the same sentences. We conducted the t- curred in a short time period. Thus, users do not need test for the bias tagged sentences and non-tagged sen- to spend much time to understand the context of the tences. Table 2 shows the statistically significant POS news event. However, in case of a long time lasting tags under the p-value < 0.001. news event, the news topic tends to be complicated or Analysis of further linguistic features. We consists of many sub-events and there might be many also investigate words by using the linguistic cate- aspects to be aware of. For example, politics-related gories proposed by [RDJ13], including sentiment, sub- news events, typically have a long time span when ject/object, verb types, named entity and so on. In they cover elections the reports on actions of candi- Table 3, we observe that the most significant word dates appear in the weeks beforehand. For detecting category is negative subject words in agreement level and/or minimizing the news bias under more complex 1. Also weak subject words and negative words are situations, an alternative strategy for obtaining a rea- shown to be significant. We believe this result is be- cause our news event is controversial and related to 5 Only significant results are shown (p < 0.001). achieved by measuring the effect of article read- Table 3: Linguistic Feature Effects by t-test in Each ing by not only asking readers before and after Agreement Level5 the reading about their opinion on topic/event, but also by correlating the read news with ac- Agreement Level 1 3 5 tions, such as the votes of readers in upcoming Factive verb -10.154 elections. Assertive verb -3.2339 -4.3784 Acknowledgments This research was supported Implicative verb -3.7975 in part by MEXT grants (#17H01828; #18K19841; Entailment -2.7975 #18H03243). Weak subject word 5.5862 4.917 Negative word 7.5961 5.6002 Bias Lexicon -2.9986 References Named Entity 3.375 [ABHK08] Karel Jan Alsem, Steven Brakman, Lex Negative subject words 9.7921 8.2414 Hoogduin, and Gerard Kuper. The im- pact of newspapers on consumer confi- dence: does spin bias exist? Applied Eco- sonable ground truth concerning news bias might be nomics, 40(5):531–539, 2008. to focus on credibility aspects and to target the recom- mendation of citations to clearly and formally stated [ACG+ 12] Jisun An, Meeyoung Cha, Krishna P Gum- facts and/or events, such as ones in existing knowledge madi, Jon Crowcroft, and Daniele Quercia. bases. Visualizing media bias through Twitter. In Proc. of ICWSM SocMedNews Workshop, 4 Conclusions and Future Works 2012. Detecting news bias is a challenging task for computer [ADS17] Héctor Martı́nez Alonso, Amaury Dela- science as well as linguistics and media research areas maire, and Benoı̂t Sagot. Annotating due to the subtle nature and heterogeneous, diverse omission in statement pairs. In Proc. of kinds of biases. In this paper, we set up a crowdsourc- LAW@EACL 2017, pages 41–45, 2017. ing task to annotate news articles concerning bias- inducing words. We then analyzed features concerning [Ben16] W Lance Bennett. News: The politics of the annotated words based on different user agreement illusion. University of Chicago Press, 2016. levels. Based on the results, we make the following conclusions: [BEQ+ 15] Eric Baumer, Elisha Elovic, Ying Qin, 1. Generally, it is hard to reach an agreement among Francesca Polletta, and Geri Gay. Testing users concerning biased words or sentences. and comparing computational approaches 2. According to results, it is reasonable to focus on for identifying the language of framing in linguistic features, such as negative words, nega- political news. In Proc. of NAACL HLT tive subjective words, etc. for detecting bias on 2015, pages 1472–1482, 2015. a word level. This also means that for detect- ing bias, capturing the context, such as having [BES10] Stefano Baccianella, Andrea Esuli, and semantically-structured representations of state- Fabrizio Sebastiani. SentiWordNet 3.0: An ments or sentences might not be needed for a shal- Enhanced Lexical Resource for Sentiment low bias detection. Analysis and Opinion Mining. In Proc of 3. Our experiments on the characteristics of bias- LREC 2010, 2010. inducing words indicate that presenting the read- ers with bias-inducing words (e.g., by highlighting [DA00] Dave D’Alessio and Mike Allen. Media bias them in the text) is still worthwhile to be pursued in presidential elections: A meta-analysis. in the future. Journal of communication, 50(4):133–156, 4. A deeper analysis of bias in the news is needed. 2000. Current efforts, such as the SemEval 2019 Task 4 (“Hyperpartisan News Detection”)6 , can be seen [FMK+ 10] Tim Finin, William Murnane, Anand as first steps in this direction. More generally, we Karandikar, Nicholas Keller, Justin Mar- argue that we need novel ways to measure the ac- tineau, and Mark Dredze. Annotating tual bias of news (and other texts). This could be Named Entities in Twitter Data with Crowdsourcing. In Proc. of CSLDAMT’10, 6 https://pan.webis.de/semeval19/semeval19-web/ pages 80–88, 2010. [GM05] Tim Groseclose and Jeffrey Milyo. A mea- sure of media bias. The Quarterly Journal of Economics, 120(4):1191–1237, 2005. [GS06] Matthew Gentzkow and Jesse M Shapiro. Media bias and reputation. Journal of po- litical Economy, 114(2):280–316, 2006. [HMG17] Felix Hamborg, Norman Meuschke, and Bela Gipp. Matrix-Based News Aggrega- tion: Exploring Different News Perspec- tives. In Proc. of JCDL 2017, pages 69–78, 2017. [NR10] Stefanie Nowak and Stefan M. Rüger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Proc. of MIR 2010, pages 557–566, 2010. [OMY11] Tatsuya Ogawa, Qiang Ma, and Masatoshi Yoshikawa. News bias analysis based on stakeholder mining. IEICE Transactions, 94-D(3):578–586, 2011. [PKCS09] Souneil Park, Seungwoo Kang, Sangyoung Chung, and Junehwa Song. NewsCube: delivering multiple aspects of news to mit- igate media bias. In Proc. of SIGCHI on Human Factors in Computing Systems, pages 443–452, 2009. [RDJ13] Marta Recasens, Cristian Danescu- Niculescu-Mizil, and Dan Jurafsky. Linguistic models for analyzing and de- tecting biased language. In Proc. of ACL 2013, volume 1, pages 1650–1659, 2013. [TKMS03] Kristina Toutanova, Dan Klein, Christo- pher D. Manning, and Yoram Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proc. of HLT-NAACL 2003, pages 173–180, 2003. [ZLP+ 15] Arkaitz Zubiaga, Maria Liakata, Rob Procter, Kalina Bontcheva, and Peter Tolmie. Crowdsourcing the annotation of rumourous conversations in social media. In Proc. of WWW 2015, pages 347–353, 2015.