     Understanding Characteristics of Biased Sentences in
                      News Articles

              Sora Lim                                  Adam Jatowt                      Masatoshi Yoshikawa
          Kyoto University                             Kyoto University                    Kyoto University
            Kyoto, Japan                                Kyoto, Japan                         Kyoto, Japan
    lim.sora.88u@st.kyoto-u.ac.jp                  adam@dl.kuis.kyoto-u.ac.jp          yoshikawa@i.kyoto-u.ac.jp

                                                                 own views towards the society, politics and other top-
                                                                 ics. Furthermore, they need to attract readers to make
                          Abstract                               their businesses profitable. This frequently leads to the
                                                                 potentially harmful reporting style resulting in biased
      Providing balanced and good quality news ar-               news.
      ticles to readers is an important challenge in                To overcome news bias, as a remedy, users often
      news recommendation. Often, readers tend                   try to choose news articles from news sources (outlets)
      to select and read articles which confirm their            which are known to be relatively unbiased. Ideally, this
      social environment and their political beliefs.            should be performed by corresponding recommender
      This issue is also known as filter bubble. As a            systems. However, bias-free article recommendations
      remedy, initial approaches towards automati-               are still not feasible given the state-of-the-art. Fur-
      cally detecting bias in news articles have been            thermore, the recommendations might not be trusted
      developed. Obtaining a suitable ground truth               by users, as readers often need concrete evidence of
      for such a task is however difficult. In this pa-          bias in the form of bias-inducing words and similar
      per, we describe ground truth dataset created              aspects.
      with the help of crowd-sourcing for fostering                 In this paper, we focus on understanding news bias
      research on bias detection and removal from                and on developing a high-quality gold standard for
      news content. We then analyze the charac-                  fostering bias-detection studies on the sentence and
      teristics of the user annotations, in particular           word levels. We assume here that word choices made
      concerning bias-inducing words. Our results                by articles’ authors might reflect some bias in terms
      indicate that determining bias-induced words               of their viewpoint. For example, the phrases “illegal
      is subjective to certain degree and that a high            immigrants” and “undocumented immigrants” chosen
      agreement on all bias-inducing words of all                by news reporters to refer to immigrants in relation
      readers is hard to obtain. We also study the               to Donald Trump’s decision to rescind Deferred Ac-
      discriminative characteristics of biased con-              tion for Childhood Arrivals may be considered as case
      tent and find that linguistic features, such as            where the choice of words can result in a bias. Here,
      negative words, tend to be indicative for bias.            the use of the word “illegal” degrades the immigrants
                                                                 by inducing more negative value than in the case of us-
1     Introduction                                               ing the adjective “undocumented”. By such nuanced
In news reporting it is important for both authors               word choices, news authors may imply their stance on
and readers to maintain high fairness, accuracy, and             the news event and deliver biased view to the readers.
to keep balance between different view points. How-                 It is, however, challenging to identify words that
ever, bias in news articles has become a major issue             cause the article to have biased points of view
[GM05, Ben16] even though many news outlets claim                [BEQ+ 15]. The bias inherent in news articles tend to
to have dedicated policy to assure the objectiveness in          be subtle and intricate. In this research, we construct
their articles. Different news sources may have their            a comparable news dataset which consists of news ar-
                                                                 ticles reporting the same news event. The objective is
shed new light on the way in which users recognize
                                                                      Table 1: Statistics of Labeled Sentences
bias in news articles. To the best of our knowledge,
this is the first dataset with annotated bias words in
                                                               Total number of news articles                  88
news articles. In the following, we describe the design        Total number of sentences                      1,235
of the crowd-sourcing task to obtain the bias labels           Average tagged sentences per a news article    73.48%
for the news words and we subsequently analyze the             No. of sentences including tagged words        826 (66.88%)
characteristics of detected biased content in news.            No. of tagged sentences on agreement level 2   431 (34.90%)
                                                               No. of tagged sentences on agreement level 3   173 (14.01%)
                                                               No. of tagged sentences on agreement level 4   42 (3.40%)
2    Related Works                                             No. of tagged sentences on agreement level 5   7 (0.57%)

Several prior works have focused on media bias in gen-
eral and news bias in particular. Generally, accord-          guage in political news as well as features from theo-
ing to D’Allessio and Allen [DA00], media bias can            retical literature on framing.
be divided into three different types: (1) gatekeeping,
(2) coverage and (3) statement bias. Gatekeeping bias         3     Annotating Bias in News Articles
is a selection of stories out of the potential stories;
                                                              3.1    Dataset
coverage bias expresses how much space specific po-
sitions receive in media; statement bias, in contrast,        To detect the subtle differences which cause bias, one
denotes how an author’s own opinion is woven into a           way is to compare words across the content of different
text. Similarly, Alsem et al. [ABHK08] divide news            news articles which are reporting the same news event.
bias into ideology and spin. Ideology reflects news out-      This should allow for pinpointing differences in the
lets’ desire to affect readers’ opinions in a particular      subtle use of words by different authors from diverse
direction. Spin reflects the outlet’s attempt to simply       media outlets to describe the same event. Although,
create a memorable story. Given these distinctions, we        many news datasets were created for news analysis, to
consider the bias type tackled in this paper as state-        the best of our knowledge, none focused on a single
ment bias w.r.t. [DA00] and as spin bias according to         event while, at the same time, covering many news
[ABHK08].                                                     articles from various news outlets from a short time
    Several researches made efforts to provide effective      range.
means for solving the news bias problem. However,                We selected the news event titled “Black men ar-
most of them have focused on the news diversification         rested in Starbucks” which has caused controversial
according to the content similarity and the political         discussions on racism. The event happened on April
stance of news outlets. Park et al. [PKCS09], for in-         12, 2018. We focused on news articles written on April
stance, have developed a news diversification system,         15, 2018 as the event was widely reported in different
named NewsCube, to mitigate the bias problem by pro-          news on that day.
viding diverse information to the users. Hambourg                For collecting news articles from various news out-
et al. [HMG17] presented a matrix-based news analy-           lets we used Google News2 . Google News is a conve-
sis to display various perspectives for the same news         nient source for our case as it already clusters news
topic in a two-dimensional matrix. An et al. [ACG+ 12]        articles concerning the same event coming from vari-
revealed skewness of news outlets by analyzing their          ous sources. We first crawled all news articles available
news contents spread throughout tweets.                       online that described the aforementioned event. Based
    Alonso et al. [ADS17] focused on omissions between        on manual inspection, we then verified whether all arti-
news statements which are similar but not identical.          cles are about the same news event. We next extracted
The omission occupies one category in news bias in            the titles and text content from the crawled pages ig-
that it is a means of statement bias [GS06]. Ogawa et         noring pages which covered only pictures or contained
al. [OMY11] attempted to describe the relationship be-        only a single sentence. In the end, our dataset con-
tween main participants in news articles to detect news       sists of 89 news articles with 1,235 sentences and 2,542
bias. To catch describing way of the relationship, they       unique words from 83 news outlets. Articles contain
expanded sentiment words in SentiWordNet [BES10].             on average 14 paragraphs.
    Other works focused on linguistic analysis for bias
detection on text data. Recasens et al. [RDJ13] tar-          3.2    Bias Labeling via Crowd-Sourcing
geted detecting bias words from the revised sentence
                                                              To overcome scalability issue in annotations, crowd-
history in Wikipedia. They utilized NPOV tags for
                                                              sourcing has been widely used [FMK+ 10, ZLP+ 15].
bias labels, and linguistically categorized resources for
                                                              We also use crowdsourcing to collect bias labels and
the bias feature. Baumer et al. [BEQ+ 15] used Re-
casens et al.’s linguistic features to identify biased lan-       2 https://news.google.com/?hl=en-US&gl=US&ceid=US:en
we choose Figure Eight3 as our platform. Figure Eight         88 documents, we collected 2,982 bias words (1,647
(called CrowdFlower until March 2018) has been used           unique words) covered by 1,546 non-overlapping an-
in a variety of annotation tasks and is especially suit-      notations.
able for our purposes due to the focus on producing
high-quality annotations. We note that it is difficult        3.3   Analysis of Perceived News Bias
to obtain bias-related label information such as binary
judgements on each sentence of news articles, as the          We next analyze what kind of words are tagged as bias
bias may depend on the news event and its context.            triggers by the workers. First, we analyze the phrases
To design the bias labeling task, we divided the news         annotated as biased in terms of the word length. Each
dataset into one reference news article4 and 88 target        annotation consists of four words on average (examples
news articles. Having a reference news article, users         being “did absolutely nothing wrong”, “putting them
could first get familiar with the overall event. Fur-         in handcuffs”, “racism and racial profiling”, “merely
thermore, the motivation was to have some reference           for their race”, and “Starbucks manager was white”).
text which being relatively bias-free allows for detect-      Most answers submitted by workers are, however, sin-
ing bias content in a target article. Our reference ar-       gle words, for example, “accuse”, “absurd”, “boy-
ticle has been selected after being manually judged as        cott”, “discrimination”, and “outrage”. These exam-
relatively unbiased according to several annotators.          ples also show a tendency of negative sentiment and
   We let the workers make judgements on each tar-            that rather extreme, emotion-related words are anno-
get news article (using also the reference news article).     tated, which could be extracted almost without consid-
Each article has been independently annotated by 5            ering the context. As second most frequent phrase pat-
workers. In order to ensure a high-quality labeling,          tern, three words in a sentence have been annotated,
we produced various test questions to filter out low          such as “absolutely nothing wrong”, “accusations of
quality answers. To create reliable answers to our test       racism”, “black men arrested”, “who is black”, and
questions, we conducted a preliminary labeling task on        “other white ppl”. These are typical combinations of
a set of five randomly selected news articles from the        sentiment words and modifiers or intensifiers. These
same news collection, plus the same reference news            sentiment words (with positive or negative polarity)
article used for comparison. Nine graduate students           are typically associated with the overall topic or event
(male: 6, female: 3) labeled bias-inducing words in           and can also be considered as outstanding or salient
these news articles. The words which have been la-            to some degree.
beled as “bias-inducing” by at least two people were             We aggregated the answers of the crowd-workers on
considered as “biased” in general and served as ground        the sentence level assuming that if a sentence includes
truth for our test questions.                                 any word annotated as biased, the sentence itself is
   The instructions and main questions given to the           biased. Note that the information on sentence level
workers in the crowdsourcing tasks and to annotators          bias might be enough for the purpose of automatic
in the preliminary task can be summarized as follows:         bias detection. However, we let users annotate the
 1. Read the target news article and the reference news       specific bias-inducing phrases, since this lets us gain a
    article.                                                  fine-grained insight in the actual thoughts of users and
 2. Check the degree of bias of the target news article by
                                                              allows to choose appropriate machine learning features
    comparing with the reference news article.
                                                              for bias-detection algorithms, as well as to show con-
       • not at all biased, slightly biased, fairly biased,
          strongly biased.                                    crete evidence of bias-inducing aspects in the texts to
 3. Select and submit words or phrases which cause the        users. Table 1 shows the statistics of the dataset and
    bias, compared to the reference news article.             labeled results. Agreement level n denotes that only
       • Submit words or phrases with the line identifier.    annotations tagged by at least n people are consid-
       • Try to submit as short as possible content and       ered. When we only consider the unique, i.e., fusioned
          don’t submit whole paragraphs.                      answers from the workers, among 1,235 sentences in
       • If no bias inducing words are found, submit          the whole data set, 826 sentences (66.88%) included
          “none”.                                             bias-annotated words. On average, 73.48% of the sen-
 4. Select your level of understanding of the news story      tences would be then considered potentially biased in
       • four scale ratings from “I didn’t understand at      an article. Yet, assuming an agreement of 2 workers
          all.” to “I understood well.”
                                                              the average number of biased sentences is 34.9%, while
   In total, 60 workers participated in the task. We          for n = 3 the corresponding number is 14.01%. These
only used the answers from 25 reliable workers who            statistics reveal that people consider different words as
passed at least 50% of test questions. Overall, for           representing biased content through different words.
  3 https://www.figure-eight.com/.                               Inter-rater agreement. We next investigated the
  4 https://reut.rs/2ve3rMz                                   inter-rater agreement among the five workers’ answers
                                                           Table 2: POS Feature Effects by t-test in Each Agree-
                                                           ment Level5
       0.4                                                  Agreement Level                1         3           5

       0.2                                                  Cardinal number (CC)           5.19      4.0554
                                                            Determiner (DT)                4.87                  -4.4403
       0.0                                                  Existential there (EX)         3.81                  -6.9333
       0.2                                                  Preposition/subordinating      7.63      3.4378
             Krippendorff's Alpha   Pairwise Jaccard        participle conjunction (IN)
                                                            Adjective (JJ)                 9.2987    3.4507
Figure 1: Inter-rater reliability on the Crowdsourcing      Adjective, superative (JJS)                          -7.6947
result: (a) Krippendorff’s alpha (b) Pairwise Jaccard.      Noun (NN)                      7.5422
                                                            Noun, plural (NNS)             5.3969
                                                            Predeterminer (PDT)            3.7788                -8.7549
for the each target news. We calculated Krippen-            Adverb                         5.3142
dorff’s alpha and pairwise Jaccard similarity coeffi-       Adverb, superative (RBR)                 -3.4822     -3.4797
cients. Krippendorff’s alpha are used for quantifying       Particle                       5.6674                -11.969
the extent of agreement among multiple raters, and          Verb, past tense (VBD)         6.5408
Jaccard similarity is mainly used for comparing the         Verb,       gerund/present     7.4645    3.3702
similarity between two sets. Here, we regard each           (VBG)
sentence in a target news as item to be measured.           Verb, past participle (VBN)    8.2355    4.0162      -2.6979
The mean scores calculated over all the target articles     Verb, 3rd ps. sing. present    6.1593    3.713
are 0.513 for Krippendorff’, and 0.222 for Jaccard, as      (VBZ)
shown also in Figure 1. The agreement scores show           Wh-pronoun (WP)                5.4197    2.4701
relatively low tendency which means the answers from        Wh-adverv (WRB)                                      -15.243
the five workers are diverse and with slight agreement.
In practice, it is hard to get substantial agreement on
                                                           the arrest, therefore, many negative words affect to
news articles in general [NR10]. This may have several
                                                           the bias cognition of users. Interestingly, factive verbs
reasons in our case: Firstly, the degree of perception
                                                           do not show any significant difference.
concerning bias differs from person to person. Sec-
                                                              For the preliminary experiments, we next use the
ondly, the answer coverage by people is different and
                                                           POS tags and the mentioned linguistic features for
imperfect. For example, some people might feel it is
                                                           approaching the task of automatically detecting bias.
enough to submit around five different answers on a
                                                           We employ a standard SVM model and use randomly
target news article, while others might try to find as
                                                           selected 80% of the sentences for training the model
many as possible evidences of biased content. It is then
                                                           and the remaining 20% of sentences for testing. The
hard to decide whether the differences are from insin-
                                                           classification accuracy is 70%. As our data set is pri-
cerity of individuals or the matter of their perception.
                                                           marily designed for linguistic analysis, larger numbers
    Analysis of POS tags. We investigated the part
                                                           of train/test examples are needed for obtaining more
of speech tags included in the sentences. The Stanford
                                                           reliable evaluation results.
POS Tagger [TKMS03] was employed in this process.
                                                              Further extensions. We analyzed bias in the
To that end, we considered different agreement levels,
                                                           news sentences perceived by people using crowdsourc-
i.e., the minimum number of users who tag words as
                                                           ing. In this research, we used a news event that oc-
biased in the same sentences. We conducted the t-
                                                           curred in a short time period. Thus, users do not need
test for the bias tagged sentences and non-tagged sen-
                                                           to spend much time to understand the context of the
tences. Table 2 shows the statistically significant POS
                                                           news event. However, in case of a long time lasting
tags under the p-value < 0.001.
                                                           news event, the news topic tends to be complicated or
    Analysis of further linguistic features. We
                                                           consists of many sub-events and there might be many
also investigate words by using the linguistic cate-
                                                           aspects to be aware of. For example, politics-related
gories proposed by [RDJ13], including sentiment, sub-
                                                           news events, typically have a long time span when
ject/object, verb types, named entity and so on. In
                                                           they cover elections the reports on actions of candi-
Table 3, we observe that the most significant word
                                                           dates appear in the weeks beforehand. For detecting
category is negative subject words in agreement level
                                                           and/or minimizing the news bias under more complex
1. Also weak subject words and negative words are
                                                           situations, an alternative strategy for obtaining a rea-
shown to be significant. We believe this result is be-
cause our news event is controversial and related to         5 Only significant results are shown (p < 0.001).
                                                                achieved by measuring the effect of article read-
Table 3: Linguistic Feature Effects by t-test in Each
                                                                ing by not only asking readers before and after
Agreement Level5
                                                                the reading about their opinion on topic/event,
                                                                but also by correlating the read news with ac-
 Agreement Level             1        3         5               tions, such as the votes of readers in upcoming
 Factive verb                                   -10.154         elections.
 Assertive verb                       -3.2339   -4.3784        Acknowledgments This research was supported
 Implicative verb                     -3.7975               in part by MEXT grants (#17H01828; #18K19841;
 Entailment                           -2.7975               #18H03243).
 Weak subject word           5.5862   4.917
 Negative word               7.5961   5.6002
 Bias Lexicon                         -2.9986
