Biased News Data Influence on
                             Classifying Social Media Posts

              Marija Stanojevic, Jumanah Alshehri, Eduard Dragut, Zoran Obradovic
                   Center for Data Analytics and Biomedical Informatics (DABI)
                                         Temple University
                                  Philadelphia, Pennsylvania, USA
            {marija.stanojevic, jumanah.alshehri, edragut, zoran.obradovic}@temple.edu


                                                                 1    Introduction
                                                                 In recent years, social media platforms have be-
                        Abstract                                 come leading channels for the exchange of knowl-
                                                                 edge, debates, and product or opinion advertising
    A common task among social scientists is to                  [PP10, WD07, Gly18, SKB12]. Social scientists rou-
    mine and interpret public opinion using social               tinely use data from social media platforms to sur-
    media data. Scientists tend to employ off-the-               vey public opinion on specific topics [Mos13, CSPR16,
    shelf state-of-the-art short-text classification             HBK+ 17, BM18] and computer scientists use the data
    models. Those algorithms, however, require a                 to improve the performance of state-of-the-art natu-
    large amount of labeled data. Recent efforts                 ral language processing (NLP) algorithms [CXHW17,
    aim to decrease the compulsory number of                     ACCF16, GPCR18, ZWWL18].
    labeled data via self-supervised learning and                   Social media data, while abundant, pose many chal-
    fine-tuning. In this work, we explore the use                lenges in usage: 1) user demographics are rarely avail-
    of news data on a specific topic in fine-tuning              able; 2) posts are short and sometimes hard to un-
    opinion mining models learned from social me-                derstand without context, and 3) it is challenging to
    dia data, such as Twitter. Particularly, we in-              label millions of posts manually in short time. One
    vestigate the influence of biased news data on               may overcome the first challenge by selecting only in-
    models trained on Twitter data by consider-                  formation from users where demographic information
    ing both the balanced and unbalanced cases.                  is available using multiple social platforms. However,
    Results demonstrate that tuning with biased                  this may bias the data. In order to solve the other
    news data of different properties changes the                two problems, we need systems that classify data into
    classification accuracy up to 9.5%. The ex-                  different opinion classes with limited human involve-
    perimental studies reveal that the character-                ment.
    istics of the text of the tuning dataset, such
    as bias, vocabulary diversity and writing style,
    are essential for the final classification results,
    while the size of the data is less consequen-
    tial. Moreover, a state-of-the-art algorithm
    is not robust on unbalanced twitter dataset,                     Figure 1: ULMFiT model training-flow overview
    and it exaggerates when predicting the most
    frequent label.                                                 Numerous algorithms have been proposed to cope
                                                                 with large amounts of short text [ZZL15, ZQZ+ 16,
Copyright c 2019 for the individual papers by the papers’ au-    LQH16, XC16, CSBL16, YYD+ 16, MGB+ 18]. All
thors. Copying permitted for private and academic purposes.      these algorithms are supervised in nature, and there-
This volume is published and copyrighted by its editors.
                                                                 fore, require hundreds of thousands of labels in or-
In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen,
M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the
                                                                 der to achieve adequate performance levels. In the
NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019,        last two years, algorithms such as CoVe [MBXS17],
published at http://ceur-ws.org                                  ELMo [PNI+ 18], ULMFiT [HR18] and OpenAI GPT
                                       Figure 2: ULMFiT model training details
[RNSS18] have been proposed to minimize the need          hurt the performance. We test the hypothesis on
for labeled data and increase the performance by clev-    ULMFiT algorithm described in the next section.2
erly utilizing text characteristics. Those methods start
from word-vectors pre-trained on general documents        2 Methods
and fine-tune them on domain-specific documents by
employing self-supervised learning. In the Universal      The ULMFiT model [HR18] consists of three train-
Language Model Fine-tuning for Text Classification        ing components (Figure 2). Each component is based
(ULMFiT) [HR18] self-supervised process predicts the      on the language model AWD-LSTM [MKS17] and
next word based on the previous words in the context.     consists of a word-embedding (input) layer, multiple
After the fine-tuning step, we additionally train the     LSTM-layers, and a softmax layer used to predict the
model with a small number of manually labeled text        output. Experimental results in literature prove that
instances (Figure 1).                                     multiple LSTM-layers can learn more complex con-
    Most of the datasets (e.g., AGNews, DBPedia, Ya-      texts [MBXS17, PNI+ 18, HR18] than single LSTM-
hoo Answers) [ZZL15] used for testing text classifica-    layer models.
tion algorithms are balanced and on average contain           In the first part of ULMFiT, words and contexts
much longer texts than social media posts. On the         embedding       is learned from general texts (such as
other hand, social media data retrieved with a pur-       Wikipedia).       In the second part, they are updated
pose to model opinion is usually unbalanced. The          (fine-tuned)     with   topic-related data to learn domain-
goal of this paper is to investigate the performance of   specific   words   and    phrases. The third part is trained
ULMFiT model on classifying social media posts for        on   labeled  domain-specific     examples so it can predict
different settings of fine-tuning and labeled datasets.   labels   for the  new    examples.   The output of each part
We test balanced and imbalanced labeled social me-        is the   input  in the   next  step.
dia datasets and fine-tuning news texts with different        Even though the ULMFiT model is complex, it can
characteristics (e.g., size, bias, writing style).        be   trained efficiently on GPUs when smartly imple-
    Experiments utilize Twitter data related to USA       mented.     Once trained, the first part of the model does
midterm elections from 2018 and news data from the        not   change,    so we use WT103 pre-trained vectors to
USA elections 2016. The news data is collected from       reduce    training  time. 3
six major outlets which are considered to have a bias         In order to speed up fine-tuning (Figure 2b), we
                                                1
towards the left or right political spectrum . We test    use   different learning rates for each LSTM layer. The
how fine-tuning with articles from different news out-    top   layer, which calculates softmax, has the largest
lets influences the accuracy of social media posts clas-  learning    rate, η L . Learning rates for remaining layers
sification. The hypothesis is that fine-tuning with ap-   are set to nl−1 = nl /2.6 for l ∈ (1, L) as suggested
propriate topic-related text from news can help im-       in a prior study[HR18]. Instead of having a constant
prove classification, but bias in news articles can also     2 The       latest       text      classification progress:
                                                                http://nlpprogress.com/english/text classification.html
  1 Information   about    outlet   bias   is   taken   from:      3 WT103        word-vectors    can   be       found    here:
https://mediabiasfactcheck.com                                  http://files.fast.ai/models/wt103
learning rate, slanted triangular learning rates are used
                                                                              Table 1: Outlets
for every layer to improve the accuracy of the model
                                                                      Outlet                  Bias           #Words
[HR18]. First, the learning rate sharply linearly in-
creases so that the model can learn fast from the first         CNN News (CNN)                 left           426,778
examples. Once learning rate achieves the η L , it slowly     Washington Post (WP)         left-center       9,229,176
linearly declines as shown in the top-right corner of            BBC News (BBC)           neutral-left       1,247,437
Figure 2.                                                      MarketWatch (MW)          neutral-right       1,505,107
   In the third step (Figure 2c), layers are trained         Wall Street Journal (WSJ) right-center           547,548
gradually. First, only the top layer is trained with la-           FoxNews (FN)               right          3,082,912
beled data for one epoch while other layers are frozen.
                                                            kens for testing [MXBS16]. Our system is trained us-
In each new epoch, the next frozen layer from the top
                                                            ing the architecture in Figure 2a. The vocabulary has
is added to the training.
                                                            267K unique tokens. In this paper word and token
                                                            have interchangeable meanings.
3   Experiments                                                For the fine-tuning step, we explore ten different
Experiments are conducted using Twitter data on             settings: 1) ”all news” text with the data from all
USA midterm elections 2018 and news data from USA           outlets + tweets text; 2) only the tweets; 3) text
elections 2016.                                             from ”left-biased” outlets + tweets text; 4) text from
   Twitter data is collected by searching for posts         ”right-biased” outlets + tweets text. Remaining six
published between November 4th and 7th 2018 which           experiments contain text from one outlet and tweets
have one of the hashtags: ”#vote”, ”#trump”,                text. We randomly permute examples in a fine-tuning
”#election”, ”#midtermelection”, ”#democrats”,              dataset before usage.
”#republicans” and ”#2018midterms”. In total, we               In the third step, experiments test two settings of
accrue 936,462 tweets. Most of the posts are retweets,      labeled Twitter data. Mix 1 (balanced mix) contains
which appear multiple times in the corpus. After            380 examples with label 0 (left), 323 examples with la-
retweets removal, 244,320 distinct posts remained, and      bel 1 (neutral) and 323 examples with label 2 (right).
we pre-process their text by removing all characters,       Mix 2 (unbalanced mix) contains 380 examples with
except alphanumerics.                                       label 0 (left), 823 examples with label 1 (neutral) and
                                                            323 examples with label 2 (right). We randomly split
   Out of those posts, we label 1,526 examples with 0,
                                                            labeled data into three disjoint parts: test (200 ex-
1 or 2. Label 0 is assigned to examples that support
                                                            amples), validation (200 examples) and training (626
or promote the left political spectrum or denounce the
                                                            examples in Mix 1 and 1126 examples in Mix 2). Each
right point of view. Label 1 is given to politically neu-
                                                            experiment is repeated four times and accuracy mean,
tral posts (e.g., posts that encouraged voting). Label
                                                            and the standard deviation is reported for each of the
2 is assigned to examples that support or advertise
                                                            ten settings.
the right political spectrum or condemn the left point
                                                               We do not clean Twitter, and news data of non-
of view. We discard 500 examples (∼ 25% of posts)
                                                            relevant examples in order to emulate the real-world
because they are unrelated to elections.
                                                            situation. The data retrieval process is intention-
   News data is collected from six outlets that are
                                                            ally simple to mirror the information extraction pro-
perceived to have different political partisanship, rang-
                                                            cess often used in research papers [Mos13, CSPR16,
ing from the left-oriented to right-oriented outlets
                                                            HBK+ 17, BM18]. Those experiments test the robust-
based on media bias fact check website (Table 1). Ar-
                                                            ness of the model to the bias and noise in data and
ticles published between October 2015 and May 2017
                                                            robustness to the unbalanced classes.
that contain words ”election”, ”ballot”, ”republican”,
”GOP”, or ”democrat” are selected. The news ar-
ticles differ substantially in writing style, content di-   4   Results and discussion
versity, bias, number of articles and number of words       We repeat each experiment four times, and we report
(Table 1). As with the tweets, news articles do not         the accuracy mean and standard deviation in Table
always discuss the U.S. elections. Sometimes, they          2. High standard deviation (1.2 − 5.3%) indicates the
debate Brexit or elections in France and other coun-        model’s sensitivity to the order of examples in the fine-
tries worldwide. In pre-processing, we remove all non-      tuning data and a need for more labeled examples.
alphanumeric characters from news articles.                    Results provide evidence that the model is not ro-
   Experiments settings. We use the pre-trained             bust to unbalanced datasets. When Mix 1 and Mix 2
WT103 token-vectors in the first ULMFiT step.               results are compared, the model always achieved bet-
WT103 has 103 million tokens from Wikipedia texts           ter results for Mix 2 (Table 2) which has 54% of neu-
for training, 217K tokens for validation and 245K to-       tral labels as compared to 31.5% of neutral labels in
                                         Table 2: Classification results
                    News sources included                  Mix 1            Mix 2
                   (Left : Neutral : Right)        (380 : 323 : 323)        (380 : 823 : 323)
                             All news                    53.2 ± 3%          59.4 ± 3.7%
                             No news                     56 ± 5.3%          66.6 ± 2.5%
                  Left-biased (CNN+WP+BBC)              49.2 ± 2.9%         61.1 ± 3.3%
                  Right-biased (MW+WSJ+FN)              51.7 ± 3.8%         63.0 ± 3.2%
                               CNN                     58.7 ± 1.2%          62.7 ± 3.0%
                      Washington Post (WP)              55.6 ± 3.0%         60.7 ± 1.4%
                               BBC                      55.1 ± 3.1%         64.1 ± 2.7%
                       MarketWatch (MW)                 56.5 ± 2.6%         64.2 ± 1.8%
                    Wall Street Journal (WSJ)           57.7 ± 3.7%         60.0 ± 4.3%
                          FoxNews (FN)                  53.2 ± 2.9%         61.9 ± 3.3%


Figure 3: Balanced Twitter Dataset: Percent of pre-        Figure 4: Unbalanced Twitter Dataset: Percent of pre-
dicted labels from each class when fine-tuned with ten     dicted labels from each class when fine-tuned with ten
different combinations of news outlets texts               different combinations of news outlets texts
Mix 1. As evident from Figure 4, 80 − 90% of pre-          tuning with ”CNN” data because the model is trained
dicted labels for Mix 2 are neutral. Therefore, better     to focus more on left-relevant contexts.
results for Mix 2 are achieved because the algorithm          The next best results for Mix 1 are achieved when
exaggerates the most frequent (neutral) label in the       fine-tuning with news articles from The Wall Street
imbalanced dataset (which contains 54% of examples         Journal because its articles often discuss both sides in
of that class).                                            detail (sometimes even in the same sentence). Hence,
   The classification accuracy difference between Mix      when the model is trained with data from this out-
1 and 2 is the largest (11.9%) when ”left-biased news”     let, it understands relevant phrases and predicts ”left”
is used for fine-tuning. In this case, the accuracy on     and ”right” labels with higher accuracy. On the other
both Mix 1 and Mix 2 decreases compared to when            hand, ”The Wall Street Journal” fine-tuned experi-
”No news” is present. However, outlet bias has more        ment predicts much more often ”right” label for ”left-
influence on the accuracy of Mix 1.                        labeled” example than the other experiments.
   Figure 3 reveals that using ”all news” data for fine-      The confusion matrices created for each experiment
tuning achieves the best balance among predicted la-       and Figure 4 reveal that the algorithm recognizes the
bels for Mix 1. However, almost half of predicted labels   right label easier than the left label in Mix 2. A better
are wrong, so accuracy is low.                             understanding of the right label can be explained with
   Labeled Twitter data demonstrate diversity among        the different writing style of left-labeled tweets, which
posts with label ”left”. They often talk only about one    reflects a more diverse set of topics and entities as
particular issue and have fewer hashtags that support      discussed above. The best accuracy score for Mix 2
the left political spectrum. Additionally, the diversity   is achieved when ”no news” data is used for the fine-
of people and entities mentioned is more prominent in      tuning process. Most of the labels are neutral, and
the posts labeled as ”left” than those labeled ”right”     news data is mainly left or right oriented/biased, so it
(which mainly mention president Trump). Hence, the         influences the accuracy negatively.
best performance for Mix 1 is achieved when fine-             As hypothesized, results demonstrate that fine-
tuning with biased news datasets can influence accu-        6   Acknowledgements
racy in contrasting ways. Different influence of bi-
                                                            This research was supported in part by the NSF grants
ased news is particularly visible in the results of Mix
                                                            IIS-1842183.
1 where the difference between the best and the worst
accuracy for different fine-tuning settings is 9.5%. In
Mix 2 this difference is also notable, 7.2%. Influence of   References
the bias is not uniform. While fine-tuning with ”left-      [ACCF16] Orestes Appel, Francisco Chiclana, Jenny
biased news” gives the worst result for Mix 1, its per-              Carter, and Hamido Fujita. A hybrid ap-
formance for Mix 2 is average when compared to other                 proach to the sentiment analysis problem
experiments. On the other hand, fine-tuning with ”all                at the sentence level. Knowledge-Based
news” gives the worst results for Mix 2 and average                  Systems, 108:110–124, 2016.
results for Mix 1.
    The size of the fine-tuning data does not seem to in-   [BM18]      Marco Bastos and Dan Mercea.
fluence the results. ”Washington Post” has the largest                  Parametrizing brexit:   mapping twit-
amount of words, but it achieves average results in                     ter political space to parliamentary
both mixes. ”CNN” is the smallest dataset, but it                       constituencies.    Information, Com-
achieves the best result for Mix 1. It is interesting                   munication & Society, 21(7):921–939,
to notice that ”all news” achieves worse results than                   2018.
”no news” fine-tuning for both Mix 1 and Mix 2, even
though in literature, training with more data often         [CSBL16]    Alexis Conneau, Holger Schwenk, Loı̈c
contributes to better results. This result suggests that                Barrault, and Yann Lecun. Very deep con-
the content (bias) of the fine-tuning dataset is more                   volutional networks for text classification.
important than its size.                                                arXiv preprint arXiv:1606.01781, 2016.
    Accuracy behavior in many experiments requires
                                                            [CSPR16]    Fabio Celli, Evgeny Stepanov, Massimo
further analysis in order to better understand the in-
                                                                        Poesio, and Giuseppe Riccardi. Predict-
fluence of fine-tuning text characteristics on the per-
                                                                        ing brexit: Classifying agreement is better
formance. Additionally, the effect of non-relevant text
                                                                        than sentiment and pollsters. In Proceed-
on the accuracy should be further tested since its fre-
                                                                        ings of the Workshop on Computational
quency is high in both news and Twitter data. Since
                                                                        Modeling of Peoples Opinions, Personal-
results clearly show that this model is not robust on
                                                                        ity, and Emotions in Social Media (PEO-
bias and noise, other novel methods should be tested
                                                                        PLES), pages 110–118, 2016.
similarly. It is essential to create unbalanced and bi-
ased datasets for fine-tuning and testing of the future     [CXHW17] Tao Chen, Ruifeng Xu, Yulan He, and
models to create robust methods that would be bene-                  Xuan Wang. Improving sentiment anal-
ficial to the real-world applications.                               ysis via sentence type classification using
                                                                     bilstm-crf and cnn. Expert Systems with
                                                                     Applications, 72:221–230, 2017.
5   Conclusion
                                                            [Gly18]     Carroll J Glynn. Public opinion. Rout-
In this work we have shown that bias, noise and text
                                                                        ledge, 2018.
properties need to be accounted for when constructing
data for fine-tuning language models. Text size does        [GPCR18] Aitor Garcı́a-Pablos, Montse Cuadros,
not seem to be an important dimension. We performed                  and German Rigau. W2vlda: almost un-
experiments with data collected from Twitter and six                 supervised system for aspect based senti-
news outlets using ULMFiT language model. Results                    ment analysis. Expert Systems with Ap-
show that the algorithm is not robust to noise in data,              plications, 91:127–137, 2018.
to bias in the fine-tuning dataset, or to the dataset
imbalance.                                                  [HBK+ 17] Philip N Howard, Gillian Bolsover, Bence
   While conducted experiments show weaknesses of                     Kollanyi, Samantha Bradshaw, and Lisa-
the existing system, further work is needed to un-                    Maria Neudert. Junk news and bots dur-
derstand better the relationship between properties of                ing the us election: What were michi-
fine-tuning data and specific tasks. Additionally, bet-               gan voters sharing over twitter. Com-
ter models are required that are more robust to bias                  putational Propaganda Research Project,
and noise in order to be able to solve challenging real-              Oxford Internet Institute, Data Memo, 1,
world problems.                                                       2017.
[HR18]      Jeremy Howard and Sebastian Ruder.                       amazonaws. com/openai-assets/research-
            Universal language model fine-tuning                     covers/languageunsupervised/language
            for text classification. arXiv preprint                  understanding paper. pdf, 2018.
            arXiv:1801.06146, 2018.
                                                         [SKB12]     Pawel Sobkowicz, Michael Kaschesky, and
[LQH16]     Pengfei Liu, Xipeng Qiu, and Xuanjing                    Guillaume Bouchard. Opinion mining
            Huang. Recurrent neural network for                      in social media: Modeling, simulating,
            text classification with multi-task learn-               and forecasting political opinions in the
            ing. arXiv preprint arXiv:1605.05101,                    web. Government Information Quarterly,
            2016.                                                    29(4):470–479, 2012.

[MBXS17] Bryan McCann, James Bradbury, Caim-             [WD07]      Duncan J Watts and Peter Sheridan
         ing Xiong, and Richard Socher. Learned                      Dodds. Influentials, networks, and public
         in translation: Contextualized word vec-                    opinion formation. Journal of consumer
         tors. In Advances in Neural Informa-                        research, 34(4):441–458, 2007.
         tion Processing Systems, pages 6294–6305,
                                                         [XC16]      Yijun Xiao and Kyunghyun Cho. Efficient
         2017.
                                                                     character-level document classification by
[MGB+ 18] Tomas Mikolov, Edouard Grave, Piotr                        combining convolution and recurrent lay-
          Bojanowski, Christian Puhrsch, and Ar-                     ers. arXiv preprint arXiv:1602.00367,
          mand Joulin. Advances in pre-training                      2016.
          distributed word representations. In Pro-      [YYD+ 16] Zichao Yang, Diyi Yang, Chris Dyer,
          ceedings of the International Conference                 Xiaodong He, Alex Smola, and Eduard
          on Language Resources and Evaluation                     Hovy. Hierarchical attention networks for
          (LREC 2018), 2018.                                       document classification. In Proceedings of
                                                                   the 2016 Conference of the North Ameri-
[MKS17]     Stephen Merity, Nitish Shirish Keskar,
                                                                   can Chapter of the Association for Com-
            and Richard Socher. Regularizing and
                                                                   putational Linguistics: Human Language
            optimizing lstm language models. arXiv
                                                                   Technologies, pages 1480–1489, 2016.
            preprint arXiv:1708.02182, 2017.
                                                         [ZQZ+ 16]   Peng Zhou, Zhenyu Qi, Suncong Zheng,
[Mos13]     Mohamed M Mostafa. More than words:                      Jiaming Xu, Hongyun Bao, and Bo Xu.
            Social networks text mining for consumer                 Text classification improved by inte-
            brand sentiments. Expert Systems with                    grating bidirectional lstm with two-
            Applications, 40(10):4241–4251, 2013.                    dimensional max pooling. arXiv preprint
[MXBS16] Stephen Merity, Caiming Xiong, James                        arXiv:1611.06639, 2016.
         Bradbury, and Richard Socher. Pointer           [ZWWL18] Shunxiang Zhang, Zhongliang Wei, Yin
         sentinel mixture models. arXiv preprint                  Wang, and Tao Liao. Sentiment analy-
         arXiv:1609.07843, 2016.                                  sis of chinese micro-blog text based on ex-
                                                                  tended sentiment dictionary. Future Gen-
[PNI+ 18]   Matthew E Peters, Mark Neumann, Mo-
                                                                  eration Computer Systems, 81:395–403,
            hit Iyyer, Matt Gardner, Christopher
                                                                  2018.
            Clark, Kenton Lee, and Luke Zettlemoyer.
            Deep contextualized word representa-         [ZZL15]     Xiang Zhang, Junbo Zhao, and Yann Le-
            tions. arXiv preprint arXiv:1802.05365,                  Cun. Character-level convolutional net-
            2018.                                                    works for text classification. In Advances
                                                                     in neural information processing systems,
[PP10]      Alexander Pak and Patrick Paroubek.                      pages 649–657, 2015.
            Twitter as a corpus for sentiment analysis
            and opinion mining. In LREc, volume 10,
            pages 1320–1326, 2010.

[RNSS18]    Alec Radford, Karthik Narasimhan, Tim
            Salimans, and Ilya Sutskever. Improving
            language understanding by generative
            pre-training. URL https://s3-us-west-2.