Biased News Data Influence on Classifying Social Media Posts Marija Stanojevic, Jumanah Alshehri, Eduard Dragut, Zoran Obradovic Center for Data Analytics and Biomedical Informatics (DABI) Temple University Philadelphia, Pennsylvania, USA {marija.stanojevic, jumanah.alshehri, edragut, zoran.obradovic}@temple.edu 1 Introduction In recent years, social media platforms have be- Abstract come leading channels for the exchange of knowl- edge, debates, and product or opinion advertising A common task among social scientists is to [PP10, WD07, Gly18, SKB12]. Social scientists rou- mine and interpret public opinion using social tinely use data from social media platforms to sur- media data. Scientists tend to employ off-the- vey public opinion on specific topics [Mos13, CSPR16, shelf state-of-the-art short-text classification HBK+ 17, BM18] and computer scientists use the data models. Those algorithms, however, require a to improve the performance of state-of-the-art natu- large amount of labeled data. Recent efforts ral language processing (NLP) algorithms [CXHW17, aim to decrease the compulsory number of ACCF16, GPCR18, ZWWL18]. labeled data via self-supervised learning and Social media data, while abundant, pose many chal- fine-tuning. In this work, we explore the use lenges in usage: 1) user demographics are rarely avail- of news data on a specific topic in fine-tuning able; 2) posts are short and sometimes hard to un- opinion mining models learned from social me- derstand without context, and 3) it is challenging to dia data, such as Twitter. Particularly, we in- label millions of posts manually in short time. One vestigate the influence of biased news data on may overcome the first challenge by selecting only in- models trained on Twitter data by consider- formation from users where demographic information ing both the balanced and unbalanced cases. is available using multiple social platforms. However, Results demonstrate that tuning with biased this may bias the data. In order to solve the other news data of different properties changes the two problems, we need systems that classify data into classification accuracy up to 9.5%. The ex- different opinion classes with limited human involve- perimental studies reveal that the character- ment. istics of the text of the tuning dataset, such as bias, vocabulary diversity and writing style, are essential for the final classification results, while the size of the data is less consequen- tial. Moreover, a state-of-the-art algorithm is not robust on unbalanced twitter dataset, Figure 1: ULMFiT model training-flow overview and it exaggerates when predicting the most frequent label. Numerous algorithms have been proposed to cope with large amounts of short text [ZZL15, ZQZ+ 16, Copyright c 2019 for the individual papers by the papers’ au- LQH16, XC16, CSBL16, YYD+ 16, MGB+ 18]. All thors. Copying permitted for private and academic purposes. these algorithms are supervised in nature, and there- This volume is published and copyrighted by its editors. fore, require hundreds of thousands of labels in or- In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the der to achieve adequate performance levels. In the NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, last two years, algorithms such as CoVe [MBXS17], published at http://ceur-ws.org ELMo [PNI+ 18], ULMFiT [HR18] and OpenAI GPT Figure 2: ULMFiT model training details [RNSS18] have been proposed to minimize the need hurt the performance. We test the hypothesis on for labeled data and increase the performance by clev- ULMFiT algorithm described in the next section.2 erly utilizing text characteristics. Those methods start from word-vectors pre-trained on general documents 2 Methods and fine-tune them on domain-specific documents by employing self-supervised learning. In the Universal The ULMFiT model [HR18] consists of three train- Language Model Fine-tuning for Text Classification ing components (Figure 2). Each component is based (ULMFiT) [HR18] self-supervised process predicts the on the language model AWD-LSTM [MKS17] and next word based on the previous words in the context. consists of a word-embedding (input) layer, multiple After the fine-tuning step, we additionally train the LSTM-layers, and a softmax layer used to predict the model with a small number of manually labeled text output. Experimental results in literature prove that instances (Figure 1). multiple LSTM-layers can learn more complex con- Most of the datasets (e.g., AGNews, DBPedia, Ya- texts [MBXS17, PNI+ 18, HR18] than single LSTM- hoo Answers) [ZZL15] used for testing text classifica- layer models. tion algorithms are balanced and on average contain In the first part of ULMFiT, words and contexts much longer texts than social media posts. On the embedding is learned from general texts (such as other hand, social media data retrieved with a pur- Wikipedia). In the second part, they are updated pose to model opinion is usually unbalanced. The (fine-tuned) with topic-related data to learn domain- goal of this paper is to investigate the performance of specific words and phrases. The third part is trained ULMFiT model on classifying social media posts for on labeled domain-specific examples so it can predict different settings of fine-tuning and labeled datasets. labels for the new examples. The output of each part We test balanced and imbalanced labeled social me- is the input in the next step. dia datasets and fine-tuning news texts with different Even though the ULMFiT model is complex, it can characteristics (e.g., size, bias, writing style). be trained efficiently on GPUs when smartly imple- Experiments utilize Twitter data related to USA mented. Once trained, the first part of the model does midterm elections from 2018 and news data from the not change, so we use WT103 pre-trained vectors to USA elections 2016. The news data is collected from reduce training time. 3 six major outlets which are considered to have a bias In order to speed up fine-tuning (Figure 2b), we 1 towards the left or right political spectrum . We test use different learning rates for each LSTM layer. The how fine-tuning with articles from different news out- top layer, which calculates softmax, has the largest lets influences the accuracy of social media posts clas- learning rate, η L . Learning rates for remaining layers sification. The hypothesis is that fine-tuning with ap- are set to nl−1 = nl /2.6 for l ∈ (1, L) as suggested propriate topic-related text from news can help im- in a prior study[HR18]. Instead of having a constant prove classification, but bias in news articles can also 2 The latest text classification progress: http://nlpprogress.com/english/text classification.html 1 Information about outlet bias is taken from: 3 WT103 word-vectors can be found here: https://mediabiasfactcheck.com http://files.fast.ai/models/wt103 learning rate, slanted triangular learning rates are used Table 1: Outlets for every layer to improve the accuracy of the model Outlet Bias #Words [HR18]. First, the learning rate sharply linearly in- creases so that the model can learn fast from the first CNN News (CNN) left 426,778 examples. Once learning rate achieves the η L , it slowly Washington Post (WP) left-center 9,229,176 linearly declines as shown in the top-right corner of BBC News (BBC) neutral-left 1,247,437 Figure 2. MarketWatch (MW) neutral-right 1,505,107 In the third step (Figure 2c), layers are trained Wall Street Journal (WSJ) right-center 547,548 gradually. First, only the top layer is trained with la- FoxNews (FN) right 3,082,912 beled data for one epoch while other layers are frozen. kens for testing [MXBS16]. Our system is trained us- In each new epoch, the next frozen layer from the top ing the architecture in Figure 2a. The vocabulary has is added to the training. 267K unique tokens. In this paper word and token have interchangeable meanings. 3 Experiments For the fine-tuning step, we explore ten different Experiments are conducted using Twitter data on settings: 1) ”all news” text with the data from all USA midterm elections 2018 and news data from USA outlets + tweets text; 2) only the tweets; 3) text elections 2016. from ”left-biased” outlets + tweets text; 4) text from Twitter data is collected by searching for posts ”right-biased” outlets + tweets text. Remaining six published between November 4th and 7th 2018 which experiments contain text from one outlet and tweets have one of the hashtags: ”#vote”, ”#trump”, text. We randomly permute examples in a fine-tuning ”#election”, ”#midtermelection”, ”#democrats”, dataset before usage. ”#republicans” and ”#2018midterms”. In total, we In the third step, experiments test two settings of accrue 936,462 tweets. Most of the posts are retweets, labeled Twitter data. Mix 1 (balanced mix) contains which appear multiple times in the corpus. After 380 examples with label 0 (left), 323 examples with la- retweets removal, 244,320 distinct posts remained, and bel 1 (neutral) and 323 examples with label 2 (right). we pre-process their text by removing all characters, Mix 2 (unbalanced mix) contains 380 examples with except alphanumerics. label 0 (left), 823 examples with label 1 (neutral) and 323 examples with label 2 (right). We randomly split Out of those posts, we label 1,526 examples with 0, labeled data into three disjoint parts: test (200 ex- 1 or 2. Label 0 is assigned to examples that support amples), validation (200 examples) and training (626 or promote the left political spectrum or denounce the examples in Mix 1 and 1126 examples in Mix 2). Each right point of view. Label 1 is given to politically neu- experiment is repeated four times and accuracy mean, tral posts (e.g., posts that encouraged voting). Label and the standard deviation is reported for each of the 2 is assigned to examples that support or advertise ten settings. the right political spectrum or condemn the left point We do not clean Twitter, and news data of non- of view. We discard 500 examples (∼ 25% of posts) relevant examples in order to emulate the real-world because they are unrelated to elections. situation. The data retrieval process is intention- News data is collected from six outlets that are ally simple to mirror the information extraction pro- perceived to have different political partisanship, rang- cess often used in research papers [Mos13, CSPR16, ing from the left-oriented to right-oriented outlets HBK+ 17, BM18]. Those experiments test the robust- based on media bias fact check website (Table 1). Ar- ness of the model to the bias and noise in data and ticles published between October 2015 and May 2017 robustness to the unbalanced classes. that contain words ”election”, ”ballot”, ”republican”, ”GOP”, or ”democrat” are selected. The news ar- ticles differ substantially in writing style, content di- 4 Results and discussion versity, bias, number of articles and number of words We repeat each experiment four times, and we report (Table 1). As with the tweets, news articles do not the accuracy mean and standard deviation in Table always discuss the U.S. elections. Sometimes, they 2. High standard deviation (1.2 − 5.3%) indicates the debate Brexit or elections in France and other coun- model’s sensitivity to the order of examples in the fine- tries worldwide. In pre-processing, we remove all non- tuning data and a need for more labeled examples. alphanumeric characters from news articles. Results provide evidence that the model is not ro- Experiments settings. We use the pre-trained bust to unbalanced datasets. When Mix 1 and Mix 2 WT103 token-vectors in the first ULMFiT step. results are compared, the model always achieved bet- WT103 has 103 million tokens from Wikipedia texts ter results for Mix 2 (Table 2) which has 54% of neu- for training, 217K tokens for validation and 245K to- tral labels as compared to 31.5% of neutral labels in Table 2: Classification results News sources included Mix 1 Mix 2 (Left : Neutral : Right) (380 : 323 : 323) (380 : 823 : 323) All news 53.2 ± 3% 59.4 ± 3.7% No news 56 ± 5.3% 66.6 ± 2.5% Left-biased (CNN+WP+BBC) 49.2 ± 2.9% 61.1 ± 3.3% Right-biased (MW+WSJ+FN) 51.7 ± 3.8% 63.0 ± 3.2% CNN 58.7 ± 1.2% 62.7 ± 3.0% Washington Post (WP) 55.6 ± 3.0% 60.7 ± 1.4% BBC 55.1 ± 3.1% 64.1 ± 2.7% MarketWatch (MW) 56.5 ± 2.6% 64.2 ± 1.8% Wall Street Journal (WSJ) 57.7 ± 3.7% 60.0 ± 4.3% FoxNews (FN) 53.2 ± 2.9% 61.9 ± 3.3% Figure 3: Balanced Twitter Dataset: Percent of pre- Figure 4: Unbalanced Twitter Dataset: Percent of pre- dicted labels from each class when fine-tuned with ten dicted labels from each class when fine-tuned with ten different combinations of news outlets texts different combinations of news outlets texts Mix 1. As evident from Figure 4, 80 − 90% of pre- tuning with ”CNN” data because the model is trained dicted labels for Mix 2 are neutral. Therefore, better to focus more on left-relevant contexts. results for Mix 2 are achieved because the algorithm The next best results for Mix 1 are achieved when exaggerates the most frequent (neutral) label in the fine-tuning with news articles from The Wall Street imbalanced dataset (which contains 54% of examples Journal because its articles often discuss both sides in of that class). detail (sometimes even in the same sentence). Hence, The classification accuracy difference between Mix when the model is trained with data from this out- 1 and 2 is the largest (11.9%) when ”left-biased news” let, it understands relevant phrases and predicts ”left” is used for fine-tuning. In this case, the accuracy on and ”right” labels with higher accuracy. On the other both Mix 1 and Mix 2 decreases compared to when hand, ”The Wall Street Journal” fine-tuned experi- ”No news” is present. However, outlet bias has more ment predicts much more often ”right” label for ”left- influence on the accuracy of Mix 1. labeled” example than the other experiments. Figure 3 reveals that using ”all news” data for fine- The confusion matrices created for each experiment tuning achieves the best balance among predicted la- and Figure 4 reveal that the algorithm recognizes the bels for Mix 1. However, almost half of predicted labels right label easier than the left label in Mix 2. A better are wrong, so accuracy is low. understanding of the right label can be explained with Labeled Twitter data demonstrate diversity among the different writing style of left-labeled tweets, which posts with label ”left”. They often talk only about one reflects a more diverse set of topics and entities as particular issue and have fewer hashtags that support discussed above. The best accuracy score for Mix 2 the left political spectrum. Additionally, the diversity is achieved when ”no news” data is used for the fine- of people and entities mentioned is more prominent in tuning process. Most of the labels are neutral, and the posts labeled as ”left” than those labeled ”right” news data is mainly left or right oriented/biased, so it (which mainly mention president Trump). Hence, the influences the accuracy negatively. best performance for Mix 1 is achieved when fine- As hypothesized, results demonstrate that fine- tuning with biased news datasets can influence accu- 6 Acknowledgements racy in contrasting ways. Different influence of bi- This research was supported in part by the NSF grants ased news is particularly visible in the results of Mix IIS-1842183. 1 where the difference between the best and the worst accuracy for different fine-tuning settings is 9.5%. In Mix 2 this difference is also notable, 7.2%. Influence of References the bias is not uniform. While fine-tuning with ”left- [ACCF16] Orestes Appel, Francisco Chiclana, Jenny biased news” gives the worst result for Mix 1, its per- Carter, and Hamido Fujita. A hybrid ap- formance for Mix 2 is average when compared to other proach to the sentiment analysis problem experiments. On the other hand, fine-tuning with ”all at the sentence level. Knowledge-Based news” gives the worst results for Mix 2 and average Systems, 108:110–124, 2016. results for Mix 1. The size of the fine-tuning data does not seem to in- [BM18] Marco Bastos and Dan Mercea. fluence the results. ”Washington Post” has the largest Parametrizing brexit: mapping twit- amount of words, but it achieves average results in ter political space to parliamentary both mixes. ”CNN” is the smallest dataset, but it constituencies. Information, Com- achieves the best result for Mix 1. It is interesting munication & Society, 21(7):921–939, to notice that ”all news” achieves worse results than 2018. ”no news” fine-tuning for both Mix 1 and Mix 2, even though in literature, training with more data often [CSBL16] Alexis Conneau, Holger Schwenk, Loı̈c contributes to better results. This result suggests that Barrault, and Yann Lecun. Very deep con- the content (bias) of the fine-tuning dataset is more volutional networks for text classification. important than its size. arXiv preprint arXiv:1606.01781, 2016. Accuracy behavior in many experiments requires [CSPR16] Fabio Celli, Evgeny Stepanov, Massimo further analysis in order to better understand the in- Poesio, and Giuseppe Riccardi. Predict- fluence of fine-tuning text characteristics on the per- ing brexit: Classifying agreement is better formance. Additionally, the effect of non-relevant text than sentiment and pollsters. In Proceed- on the accuracy should be further tested since its fre- ings of the Workshop on Computational quency is high in both news and Twitter data. Since Modeling of Peoples Opinions, Personal- results clearly show that this model is not robust on ity, and Emotions in Social Media (PEO- bias and noise, other novel methods should be tested PLES), pages 110–118, 2016. similarly. It is essential to create unbalanced and bi- ased datasets for fine-tuning and testing of the future [CXHW17] Tao Chen, Ruifeng Xu, Yulan He, and models to create robust methods that would be bene- Xuan Wang. Improving sentiment anal- ficial to the real-world applications. ysis via sentence type classification using bilstm-crf and cnn. Expert Systems with Applications, 72:221–230, 2017. 5 Conclusion [Gly18] Carroll J Glynn. Public opinion. Rout- In this work we have shown that bias, noise and text ledge, 2018. properties need to be accounted for when constructing data for fine-tuning language models. Text size does [GPCR18] Aitor Garcı́a-Pablos, Montse Cuadros, not seem to be an important dimension. We performed and German Rigau. W2vlda: almost un- experiments with data collected from Twitter and six supervised system for aspect based senti- news outlets using ULMFiT language model. Results ment analysis. Expert Systems with Ap- show that the algorithm is not robust to noise in data, plications, 91:127–137, 2018. to bias in the fine-tuning dataset, or to the dataset imbalance. [HBK+ 17] Philip N Howard, Gillian Bolsover, Bence While conducted experiments show weaknesses of Kollanyi, Samantha Bradshaw, and Lisa- the existing system, further work is needed to un- Maria Neudert. Junk news and bots dur- derstand better the relationship between properties of ing the us election: What were michi- fine-tuning data and specific tasks. Additionally, bet- gan voters sharing over twitter. Com- ter models are required that are more robust to bias putational Propaganda Research Project, and noise in order to be able to solve challenging real- Oxford Internet Institute, Data Memo, 1, world problems. 2017. [HR18] Jeremy Howard and Sebastian Ruder. amazonaws. com/openai-assets/research- Universal language model fine-tuning covers/languageunsupervised/language for text classification. arXiv preprint understanding paper. pdf, 2018. arXiv:1801.06146, 2018. [SKB12] Pawel Sobkowicz, Michael Kaschesky, and [LQH16] Pengfei Liu, Xipeng Qiu, and Xuanjing Guillaume Bouchard. Opinion mining Huang. Recurrent neural network for in social media: Modeling, simulating, text classification with multi-task learn- and forecasting political opinions in the ing. arXiv preprint arXiv:1605.05101, web. Government Information Quarterly, 2016. 29(4):470–479, 2012. [MBXS17] Bryan McCann, James Bradbury, Caim- [WD07] Duncan J Watts and Peter Sheridan ing Xiong, and Richard Socher. Learned Dodds. Influentials, networks, and public in translation: Contextualized word vec- opinion formation. Journal of consumer tors. In Advances in Neural Informa- research, 34(4):441–458, 2007. tion Processing Systems, pages 6294–6305, [XC16] Yijun Xiao and Kyunghyun Cho. Efficient 2017. character-level document classification by [MGB+ 18] Tomas Mikolov, Edouard Grave, Piotr combining convolution and recurrent lay- Bojanowski, Christian Puhrsch, and Ar- ers. arXiv preprint arXiv:1602.00367, mand Joulin. Advances in pre-training 2016. distributed word representations. In Pro- [YYD+ 16] Zichao Yang, Diyi Yang, Chris Dyer, ceedings of the International Conference Xiaodong He, Alex Smola, and Eduard on Language Resources and Evaluation Hovy. Hierarchical attention networks for (LREC 2018), 2018. document classification. In Proceedings of the 2016 Conference of the North Ameri- [MKS17] Stephen Merity, Nitish Shirish Keskar, can Chapter of the Association for Com- and Richard Socher. Regularizing and putational Linguistics: Human Language optimizing lstm language models. arXiv Technologies, pages 1480–1489, 2016. preprint arXiv:1708.02182, 2017. [ZQZ+ 16] Peng Zhou, Zhenyu Qi, Suncong Zheng, [Mos13] Mohamed M Mostafa. More than words: Jiaming Xu, Hongyun Bao, and Bo Xu. Social networks text mining for consumer Text classification improved by inte- brand sentiments. Expert Systems with grating bidirectional lstm with two- Applications, 40(10):4241–4251, 2013. dimensional max pooling. arXiv preprint [MXBS16] Stephen Merity, Caiming Xiong, James arXiv:1611.06639, 2016. Bradbury, and Richard Socher. Pointer [ZWWL18] Shunxiang Zhang, Zhongliang Wei, Yin sentinel mixture models. arXiv preprint Wang, and Tao Liao. Sentiment analy- arXiv:1609.07843, 2016. sis of chinese micro-blog text based on ex- tended sentiment dictionary. Future Gen- [PNI+ 18] Matthew E Peters, Mark Neumann, Mo- eration Computer Systems, 81:395–403, hit Iyyer, Matt Gardner, Christopher 2018. Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representa- [ZZL15] Xiang Zhang, Junbo Zhao, and Yann Le- tions. arXiv preprint arXiv:1802.05365, Cun. Character-level convolutional net- 2018. works for text classification. In Advances in neural information processing systems, [PP10] Alexander Pak and Patrick Paroubek. pages 649–657, 2015. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, volume 10, pages 1320–1326, 2010. [RNSS18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2.