=Paper=
{{Paper
|id=Vol-2079/paper5
|storemode=property
|title=Social Media and Information Consumption Diversity
|pdfUrl=https://ceur-ws.org/Vol-2079/paper5.pdf
|volume=Vol-2079
|authors=José Devezas,Sérgio Nunes
|dblpUrl=https://dblp.org/rec/conf/ecir/DevezasN18
}}
==Social Media and Information Consumption Diversity==
Social Media and Information Consumption Diversity José Devezas Sérgio Nunes INESC TEC and Faculty of Engineering, University of Porto Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal {jld,ssn}@fe.up.pt media followers when compared to random users. When analyzing daily behavior, how- Abstract ever, the samples slightly approximate, while clearly maintaining a lower diversity for main- Social media platforms are having a profound stream media followers and a higher diversity impact on the so-called information ecosys- for random users. tem, specifically on how information is pro- duced, distributed and consumed. Social me- 1 Introduction dia in particular has contributed to the rise of user generated content and consequently Social media has become a part of our modern lives to a greater diversity in online content. On and a central service for information consumption, the other hand, social media networks, such covering a wide range of topics, from personal events to as Twitter or Facebook, have become infor- worldwide news. Several studies [CHBG10, KWM11, mation management tools that allow users MJA+ 11, LKSM14, CSR14] have focused on the study to setup and configure information sources of social media through the characterization of users, to their particular interests. A Twitter user usage patterns and content production. In this work, can handpick the sources he wishes to fol- we take advantage of Twitter to study content con- low, thus creating a custom information chan- sumption, giving particular attention to the charac- nel. However, this opportunity to create per- terization of the consumption patterns of news follow- sonalized information channels effectively re- ers. As an information diffusion service, Twitter is sults in different consumption profiles? Is frequently used for news broadcasting, either by citing the information consumed by users through a mainstream media news article, or even by directly social media networks distinct from the in- serving as a communication channel to broadcast the formation consumed though traditional main- news events themselves. Some studies have compared stream media? In this work, we set out to the content generated in Twitter with the content gen- investigate this question using Twitter as a erated by mainstream media. These studies frequently case study. We prepare two samples of users, focus on a collection of tweets, usually retrieved from one based on a uniform random selection of the Stream API, and a collection of news articles from user IDs, and another one based on a selec- well known newspapers, for a common period of time. tion of mainstream media followers. We ana- However, there are fewer studies that focus on analyz- lyze the home timelines of the users in each ing the content consumed by each Twitter user on its sample, focusing on characterizing informa- own timeline and, to our knowledge, no study that dis- tion consumption habits. We find that infor- tinguishes the content followed by Twitter users inter- mation consumption volume is higher, while ested in mainstream media from the content followed diversity is consistently lower, for mainstream by the majority of Twitter users. In this work, we studied the home timelines of a Copyright ⃝ c 2018 for the individual papers by the papers’ au- collection of Twitter users, in order to understand the thors. Copying permitted for private and academic purposes. type of content that users follow on Twitter. Par- This volume is published and copyrighted by its editors. ticularly, we were interested in comparing the gen- In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18 eral Twitter population with a specific group of main- Workshop at ECIR, Grenoble, France, 26-March-2018, pub- stream media consumers. Our goal was to investigate lished at http://ceur-ws.org to what degree the timeline of each Twitter user, i.e. the information to which the user is exposed to, differs long-standing. By looking at the distributions of topic from the timelines’ of other users. In other words, to categories and types, they discovered that Twitter pro- understand if the experience of each user is unique or, vides more entity-oriented topics with low coverage on on the contrary, if that experience is similar to that of mainstream media, and that, although Twitter shows other users. To achieve this goal, we characterized the a low interest in world news, it helped spread news of anatomy of each individual timeline, presenting aggre- important world events. The study we present here is gated results per sample and studying the diversity of similar in the sense that we also focus on understand- consumed information, overall as well as over time. ing the position of mainstream media regarding Twit- ter, but it is also different in the sense that we keep our focus on Twitter, distinguishing between the home 2 Reference Work timelines of random users and the home timelines of Bache et al. [BNS13] proposed a text-based framework mainstream media followers. Our study is centered for quantifying the diversity of documents based on around the individual (per user) consumption of con- their terms. Their approach was based on the applica- tent, for two distinct samples of users, as opposed to tion of Latent Dirichlet Allocation [BNJ03], to build a simply comparing the overall topics present in social topic model for a given corpus, and the computation media versus mainstream media. In particular, we are of the distance matrix between pairs of topics, using interested in studying the differences between the con- measurements such as topic co-occurrence and topic- tent that Twitter users are exposed to in their personal word similarity. They estimated the diversity for each timelines. document, in relation to the corpus, by combining the There are multiple metrics that can serve as a di- distance matrix with the topic distribution for the doc- versity index [Jos06, Table 1], including True Diversity, ument. Richness, Shannon Index, Simpson Index and Berger- White and Jose [WJ04] evaluated several mea- Parker Index. Most diversity metrics are transforma- surements of topic similarity, grouping them into as- tions of the effective number of types and have a par- sociation (Dice, Jaccard, Cosine, Overlap), correla- ticular interpretation dependent on the context of ap- tion (Spearman, Kendall, Pearson), and distance (Eu- plication. Our approach to studying diversity is based clidean, L1 norm, Kullback-Leibler). For assessment, on the direct comparison of home timelines from in- they used topics 101-150 from the TREC and the San dividual users from two samples: one collected ran- Jose Mercury News 1991 collection. They pre-selected domly and another one collected based on the prefer- 10 topics, ensuring a variable number of overlap be- ence to follow mainstream media accounts (i.e. users tween the most relevant documents for each topic, and that share a common interest). We then analyze the asked a group of 76 subjects to evaluate the similar- cosine distances between all pairs of accounts within ity between each pair of topics using a 5-point scale each sample, in order to quantify divergent behavior (from highly dissimilar to highly similar). While the and thus estimate diversity. evaluation was done for only ten topics, according to their study, the most useful measurement group was 3 Data Collection the correlation, followed by the association group and, only then, the distance group. In order to analyze the differences between the content Zhao et al. [ZJW+ 11] compared Twitter and main- that random users and mainstream media followers stream media using topics models. They used a sam- consume on their Twitter home timelines, we needed ple of the Edinburgh Twitter Corpus [POL10], origi- to indirectly obtain a sample of user home timelines. nally collected from the Stream API and crawled news Given Twitter does not provide this feature directly articles from the New York Times using their search through its API, our approach consisted on the fol- function. Both datasets comprised documents for the lowing five steps: timespan between November 11, 2009, and February 1, 2010. They used Latent Dirichlet Allocation to di- 1. Collect a sample of 20 users by generating random rectly extract topics from the New York Times dataset, user IDs between 1 and the largest known user ID, but, given the small size of tweets, they proposed a from a recently created user. custom Twitter-LDA algorithm for topic detection in the Twitter dataset. In order to compare Twitter with 2. Collect a sample of 20 users that follow at mainstream media, they labeled detected topics us- least 3 UK news accounts from the following ing the categories provided by the New York Times, list: @BBCNews; @guardian; @Telegraph; @In- which they had to manually assign to their Twitter dependent; @MailOnline; @DailyMirror; @The- dataset. Moreover, they used three topic types to dis- Sun; @daily_express; @metrouk; @daily_star; tinguish topics: event-oriented, entity-oriented, and @standardnews; 3. For each collected user, fetch their followed ac- users; on average, each user follows 621.42 users. Users counts. from “Sample UK News Followers 20” follow a total of 22,082 distinct users; on average, each user follows 4. At the same time, for each followed account, fetch 1,104.10 users. and store all their tweets for the past 14 days. The tweets for each user’s followed account were col- 5. Locally, for each collected user, retrieve its stored lected for a period of 14 days, with slightly different followed account timelines, ordered by decreasing start dates, resulting in an overall larger period of 55 date, thus rebuilding the home timelines per user. days, from Jul 19 2016 to Sep 12 2016. The timespan for the home timelines of the 40 users in both samples Each collected user, described in steps 1 and 2, was only overlapped for a period of 13 consecutive days, subject to a set of criteria to ensure a minimum level from Jul 20, 2016, to Aug 2, 2016. We analyzed the of expected activity and connectivity of the accounts average number of tweets over time, per day and per (a basic check to discard inactive users): hour, respectively, for each sample. While “Sample Random 20” is moderately stable per day, with a co- • The user must have created at least one tweet in efficient of variation of 29.1%, “Sample UK News Fol- the last three months. lowers 20” shows a more evident growth in the number of tweets, peaking at Jul 29 and having a coefficient of • The user must have at least three followers. variation of 42.0%. Regarding the average number of tweets per hour, the maximum number of tweets for • The user must have created at least five tweets “Sample Random 20” was generated at 20:00 UTC, Jul since the creation of the account. 23, 2016 and at 16:00 UTC, Aug 1, 2016, worldwide, for “Sample UK News Followers 20”, with coefficients The data was stored in an SQLite database. In of variation of 33.3% and 42.9%, respectively. order to define and describe each user sample, we used a “user_samples” table where we stored groups of user IDs, identified by a common sample ID. Each 5 Information Consumption “user_sample” entry also contained a textual descrip- When social media paved the way for pervasive com- tion detailing the data collection approach, as well as munication, people became both producers and con- the user selection criteria (e.g., “Random users, gen- sumers. This introduced a shift in habits with po- erated by a random uniform sampling of Twitter user tential implications to the quality and diversity of the IDs between 1 and 3954358701, restricting language to consumed information. In order to better understand ’en’, last tweet date to 2015-07-15 16:45:43, follower the impact of this change, we set to study how di- count to 3 and status count to 5.”). verse timelines are, by focusing on what users con- In this paper, we characterize and compare the sume, through their followed accounts. Our goal was timelines for two user samples: “Sample Random 20”, to answer the following questions: Do random users which represents the baseline as a collection of ran- and mainstream media followers have access to the dom Twitter users, and “Sample UK News Followers same information through different channels? Or do 20”, which represents a particular group of users who the mainstream media still play a fundamental role in have shown a general interest in mainstream media by information diffusion that cannot be replaced by reg- following well-known UK news accounts. ular Twitter users and “word-of-mouth”? 4 Data Characterization 5.1 Measuring Diversity Overall, our collection contains 5,287,221 distinct We aimed at characterizing and understanding the dif- tweets. However, as different accounts frequently have ferences between the content consumed by random followed accounts in common, the timelines overlap, users and the content consumed by users with a par- resulting in 7,758,779 analyzable tweets when looking ticular interest in mainstream media. Our approach at individual home timelines. “Sample Random 20” consisted of creating a user profile based on the tweets contains 947,068 distinct tweets, resulting in 1,080,789 received in a user’s timeline. Each tweet was prepro- (13.93%) of the overall analyzable tweets. “Sample cessed by removing emoji, links, mentions, ‘RT’ and UK News Followers 20” contains 4,685,800 distinct punctuation, and by normalizing spacing, through the tweets, resulting in 6,677,990 (86.07%) of the overall conversion of multiple spaces, tabs and new lines to a analyzable tweets. Distinct tweets in “Sample Ran- single space and the trimming of the text. We then dom 20” and “Sample UK News Followers 20” inter- created a document per user, containing a concatena- sect, resulting in 345,647 common tweets. Users from tion of all preprocessed tweets that appeared in the “Sample Random 20” follow a total of 11,807 distinct user’s home timeline. Each document was converted 1.00 5.1.1 Diversity over Time We used a similar approach to study diversity over time, but instead of using a single user profile per time- 0.75 line, we created a document per day for each user. This Cosine Distance meant slicing the two original samples into 14 smaller 0.50 parts, each part corresponding to one day, and repeat- ing the study for each day. Figure 2 depicts the dispersion of cosine distances 0.25 between all pairs of timelines, per sample, over time. The daily behavior is consistent with the aggregated overall behavior, despite resulting in a slightly higher 0.00 median cosine distance overall. This means that in- Sample Random 20 Sample UK News Followers 20 formation consumption habits from random users are more diverse than mainstream media followers, but also that information consumption diversity for ran- Figure 1: Cosine distances per sample, for all pairs of dom users is lower per day than overall for the 14 days timelines. and, on the other hand, for mainstream media follow- ers, it is higher per day than overall. This is quite to lower case and tokenized into unigrams, remov- expected, as the number of topics discussed in a single ing stopwords from several languages1 and obtaining day are intuitively less than those discussed through a document-term matrix, with the absolute term fre- the course of two weeks. quencies, per sample. Sparse terms were then pruned, ensuring a maximum sparsity of 0.996. This means that rare terms with more than 99.6% zeros, that were 6 Conclusions less useful in distinguishing user profiles, were simply We have provided a consistent methodology to study discarded. the anatomy of a sample of Twitter timelines, focusing The resulting document-term matrix for “Sample on content production and consumption, as well as on Random 20” contained 19 documents and 228,165 measuring overall and daily diversity. We studied the terms — meaning that one of the users received no home timelines of two user samples: “Sample Random tweets during for the time span of the collection — and 20”, a random selection of users based on their numeric the document-term matrix for “Sample UK News Fol- ID, and “Sample UK News Followers 20”, a selection lowers 20” contained 20 documents and 389,831 terms. of users that followed at least 3 out of 11 mainstream In order to understand how diverse each timeline was, UK newspaper accounts. within either sample, we computed the cosine distance We found that mainstream media followers consume from each timeline to all others in the same sample. a larger volume of information than random users. Timelines that are highly diverse will consistently have We analyzed the overall and the daily diversity over a high distance to most of the other timelines. Sim- the course of two weeks, based on the cosine distances ilarly, a sample will contain highly diverse timelines between all pairs of timelines, per sample. Both the if the overall distances between all timelines are high, overall and the daily diversity were consistently lower that is, timelines within a given sample considerably for the timelines of mainstream media followers, when diverge in consumed content. compared to the timelines of random users. Interest- Figure 1 shows the box plot of the cosine distances ingly, when analyzing the change from the overall two between all pairs of timelines for each sample. As we week aggregations to the daily aggregations, the sam- can see, in particular through the median, “Sample ples diversities slightly approximate, but still result in Random 20” contains timelines that are more diver- a lower diversity within mainstream media followers gent among themselves (median cosine similarity is and a higher diversity within random users. 0.87), while “Sample UK News Followers 20” contains Overall, we can say that, when compared to random timelines that are much less divergent among them- users, mainstream media followers consume a narrower selves (median cosine similarity is 0.33). We can say range of content, covering a smaller number of topics, that mainstream media followers have less diverse in- with a higher production volume. This can be ex- formation consumption habits when compared to a plained by the fact that users in this sample share a random sample of users. common interest (i.e. UK news), as opposed to the 1 We considered English, French, Spanish, Portuguese, Ara- users in the random sample that have no common bic, Russian, Greek and Hindi, but also typical expressions used characteristic. As expected, mainstream media fol- in Twitter, like ‘via’ or ‘vs’. lowers consume a less diverse variety of content. This Sample Sample Random 20 Sample UK News Followers 20 1.00 Cosine Distance 0.75 0.50 0.25 0.00 ● Jul 20 Jul 21 Jul 22 Jul 23 Jul 24 Jul 25 Jul 26 Jul 27 Jul 28 Jul 29 Jul 30 Jul 31 Aug 01 Aug 02 Time Figure 2: Cosine distances per sample, for all pairs of timelines, per day. The lines correspond to a locally weighted scatterplot smoothing (or LOESS, from LOcal regrESSion); they depict overall diversity per sample. diversity is higher when we look at individual days, [CHBG10] Meeyoung Cha, Hamed Haddadi, Fabrí- probably representing the coverage of multiple topics cio Benevenuto, and Krishna P. Gummadi. throughout a day, but lower when we look at the two Measuring user influence in twitter: The week period, probably representing the convergence of million follower fallacy. In Proceedings of topics. the Fourth International AAAI Conference In the future, we would like to analyze a larger sam- on Weblogs and Social Media (ICWSM ple of timelines, and also explore the diversity within 2010), 2010. topic-based communities, such as those focused on a given hashtag or those that share a geographical con- [CSR14] Tiago Cunha, Carlos Soares, and Ed- text. uarda Mendes Rodrigues. Tweeprofiles: detection of spatio-temporal patterns on 7 Acknowledgments twitter. In International Conference on Advanced Data Mining and Applica- José Devezas is supported by research grant tions, pages 123–136. Springer Interna- PD/BD/128160/2016, provided by the Portuguese tional Publishing, 2014. funding agency, Fundação para a Ciência e a Tecnolo- gia (FCT). This work is partially funded by FourEyes, [Jos06] Lou Jost. Entropy and diversity. Oikos, a Research Line within project “TEC4Growth – 113(2):363–375, 2006. Pervasive Intelligence, Enhancers and Proofs of [KWM11] Efthymios Kouloumpis, Theresa Wilson, Concept with Industrial Impact/NORTE-01-0145- and Johanna D. Moore. Twitter senti- FEDER-000020”, financed by the North Portugal Re- ment analysis: The good the bad and the gional Operational Programme (NORTE 2020), under omg! In Proceedings of the Fifth In- the PORTUGAL 2020 Partnership Agreement, and ternational AAAI Conference on Weblogs through the European Regional Development Fund and Social Media (ICWSM 2011), pages (ERDF). 538–541, Barcelona, Catalonia, Spain, July 2011. AAAI Press. References [LKSM14] Yabing Liu, Chloe Kliman-Silver, and Alan [BNJ03] David M Blei, Andrew Y Ng, and Mislove. The tweets they are a-changin’: Michael I Jordan. Latent dirichlet allo- cation. Journal of Machine Learning Re- Evolution of Twitter users and behavior. search, 3(Jan):993–1022, 2003. In Proceedings of the Eighth International AAAI Conference on Weblogs and Social [BNS13] Kevin Bache, David Newman, and Media (ICWSM 2014), Ann Arbor, MI, Padhraic Smyth. Text-based measures June 2014. of document diversity. In Proceedings [MJA+ 11] Alan Mislove, Sune Lehmann Jørgensen, of the 19th ACM SIGKDD international conference on Knowledge discovery and Yong-Yeol Ahn, Jukka-Pekka Onnela, and data mining - KDD ’13, page 23, 2013. J. Niels Rosenquist. Understanding the demographics of twitter users. In Pro- ceedings of the Fifth International AAAI Conference on Weblogs and Social Me- ceedings of the 27th annual international dia (ICWSM 2011), pages 554–557. AAAI conference on Research and development Press, 2011. in information retrieval - SIGIR ’04, page 520, 2004. [POL10] Saša Petrović, Miles Osborne, and Victor Lavrenko. The Edinburgh Twitter Cor- [ZJW+ 11] Wayne Xin Zhao, Jing Jiang, Jianshu pus. In Proceedings of the NAACL HLT Weng, Jing He, Ee Peng Lim, Hongfei Yan, 2010 Workshop on Computational Linguis- and Xiaoming Li. Comparing Twitter and tics in a World of Social Media, pages 25– Traditional Media using Topic Models. In 26, 2010. Advances in Information Retrieval, pages [WJ04] Ryen W White and Joemon M Jose. A 338–349. Springer Berlin Heidelberg, 2011. study of topic similarity measures. In Pro-