=Paper= {{Paper |id=Vol-2079/paper5 |storemode=property |title=Social Media and Information Consumption Diversity |pdfUrl=https://ceur-ws.org/Vol-2079/paper5.pdf |volume=Vol-2079 |authors=José Devezas,Sérgio Nunes |dblpUrl=https://dblp.org/rec/conf/ecir/DevezasN18 }} ==Social Media and Information Consumption Diversity== https://ceur-ws.org/Vol-2079/paper5.pdf
   Social Media and Information Consumption Diversity

                                       José Devezas                 Sérgio Nunes
                            INESC TEC and Faculty of Engineering, University of Porto
                               Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal
                                               {jld,ssn}@fe.up.pt



                                                                    media followers when compared to random
                                                                    users. When analyzing daily behavior, how-
                       Abstract                                     ever, the samples slightly approximate, while
                                                                    clearly maintaining a lower diversity for main-
    Social media platforms are having a profound                    stream media followers and a higher diversity
    impact on the so-called information ecosys-                     for random users.
    tem, specifically on how information is pro-
    duced, distributed and consumed. Social me-                 1   Introduction
    dia in particular has contributed to the rise
    of user generated content and consequently                  Social media has become a part of our modern lives
    to a greater diversity in online content. On                and a central service for information consumption,
    the other hand, social media networks, such                 covering a wide range of topics, from personal events to
    as Twitter or Facebook, have become infor-                  worldwide news. Several studies [CHBG10, KWM11,
    mation management tools that allow users                    MJA+ 11, LKSM14, CSR14] have focused on the study
    to setup and configure information sources                  of social media through the characterization of users,
    to their particular interests. A Twitter user               usage patterns and content production. In this work,
    can handpick the sources he wishes to fol-                  we take advantage of Twitter to study content con-
    low, thus creating a custom information chan-               sumption, giving particular attention to the charac-
    nel. However, this opportunity to create per-               terization of the consumption patterns of news follow-
    sonalized information channels effectively re-               ers. As an information diffusion service, Twitter is
    sults in different consumption profiles? Is                  frequently used for news broadcasting, either by citing
    the information consumed by users through                   a mainstream media news article, or even by directly
    social media networks distinct from the in-                 serving as a communication channel to broadcast the
    formation consumed though traditional main-                 news events themselves. Some studies have compared
    stream media? In this work, we set out to                   the content generated in Twitter with the content gen-
    investigate this question using Twitter as a                erated by mainstream media. These studies frequently
    case study. We prepare two samples of users,                focus on a collection of tweets, usually retrieved from
    one based on a uniform random selection of                  the Stream API, and a collection of news articles from
    user IDs, and another one based on a selec-                 well known newspapers, for a common period of time.
    tion of mainstream media followers. We ana-                 However, there are fewer studies that focus on analyz-
    lyze the home timelines of the users in each                ing the content consumed by each Twitter user on its
    sample, focusing on characterizing informa-                 own timeline and, to our knowledge, no study that dis-
    tion consumption habits. We find that infor-                tinguishes the content followed by Twitter users inter-
    mation consumption volume is higher, while                  ested in mainstream media from the content followed
    diversity is consistently lower, for mainstream             by the majority of Twitter users.
                                                                   In this work, we studied the home timelines of a
Copyright ⃝
          c 2018 for the individual papers by the papers’ au-   collection of Twitter users, in order to understand the
thors. Copying permitted for private and academic purposes.     type of content that users follow on Twitter. Par-
This volume is published and copyrighted by its editors.
                                                                ticularly, we were interested in comparing the gen-
In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez,
B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18
                                                                eral Twitter population with a specific group of main-
Workshop at ECIR, Grenoble, France, 26-March-2018, pub-         stream media consumers. Our goal was to investigate
lished at http://ceur-ws.org                                    to what degree the timeline of each Twitter user, i.e.
the information to which the user is exposed to, differs     long-standing. By looking at the distributions of topic
from the timelines’ of other users. In other words, to      categories and types, they discovered that Twitter pro-
understand if the experience of each user is unique or,     vides more entity-oriented topics with low coverage on
on the contrary, if that experience is similar to that of   mainstream media, and that, although Twitter shows
other users. To achieve this goal, we characterized the     a low interest in world news, it helped spread news of
anatomy of each individual timeline, presenting aggre-      important world events. The study we present here is
gated results per sample and studying the diversity of      similar in the sense that we also focus on understand-
consumed information, overall as well as over time.         ing the position of mainstream media regarding Twit-
                                                            ter, but it is also different in the sense that we keep
                                                            our focus on Twitter, distinguishing between the home
2   Reference Work
                                                            timelines of random users and the home timelines of
Bache et al. [BNS13] proposed a text-based framework        mainstream media followers. Our study is centered
for quantifying the diversity of documents based on         around the individual (per user) consumption of con-
their terms. Their approach was based on the applica-       tent, for two distinct samples of users, as opposed to
tion of Latent Dirichlet Allocation [BNJ03], to build a     simply comparing the overall topics present in social
topic model for a given corpus, and the computation         media versus mainstream media. In particular, we are
of the distance matrix between pairs of topics, using       interested in studying the differences between the con-
measurements such as topic co-occurrence and topic-         tent that Twitter users are exposed to in their personal
word similarity. They estimated the diversity for each      timelines.
document, in relation to the corpus, by combining the          There are multiple metrics that can serve as a di-
distance matrix with the topic distribution for the doc-    versity index [Jos06, Table 1], including True Diversity,
ument.                                                      Richness, Shannon Index, Simpson Index and Berger-
   White and Jose [WJ04] evaluated several mea-             Parker Index. Most diversity metrics are transforma-
surements of topic similarity, grouping them into as-       tions of the effective number of types and have a par-
sociation (Dice, Jaccard, Cosine, Overlap), correla-        ticular interpretation dependent on the context of ap-
tion (Spearman, Kendall, Pearson), and distance (Eu-        plication. Our approach to studying diversity is based
clidean, L1 norm, Kullback-Leibler). For assessment,        on the direct comparison of home timelines from in-
they used topics 101-150 from the TREC and the San          dividual users from two samples: one collected ran-
Jose Mercury News 1991 collection. They pre-selected        domly and another one collected based on the prefer-
10 topics, ensuring a variable number of overlap be-        ence to follow mainstream media accounts (i.e. users
tween the most relevant documents for each topic, and       that share a common interest). We then analyze the
asked a group of 76 subjects to evaluate the similar-       cosine distances between all pairs of accounts within
ity between each pair of topics using a 5-point scale       each sample, in order to quantify divergent behavior
(from highly dissimilar to highly similar). While the       and thus estimate diversity.
evaluation was done for only ten topics, according to
their study, the most useful measurement group was
                                                            3     Data Collection
the correlation, followed by the association group and,
only then, the distance group.                              In order to analyze the differences between the content
   Zhao et al. [ZJW+ 11] compared Twitter and main-         that random users and mainstream media followers
stream media using topics models. They used a sam-          consume on their Twitter home timelines, we needed
ple of the Edinburgh Twitter Corpus [POL10], origi-         to indirectly obtain a sample of user home timelines.
nally collected from the Stream API and crawled news        Given Twitter does not provide this feature directly
articles from the New York Times using their search         through its API, our approach consisted on the fol-
function. Both datasets comprised documents for the         lowing five steps:
timespan between November 11, 2009, and February
1, 2010. They used Latent Dirichlet Allocation to di-           1. Collect a sample of 20 users by generating random
rectly extract topics from the New York Times dataset,             user IDs between 1 and the largest known user ID,
but, given the small size of tweets, they proposed a               from a recently created user.
custom Twitter-LDA algorithm for topic detection in
the Twitter dataset. In order to compare Twitter with           2. Collect a sample of 20 users that follow at
mainstream media, they labeled detected topics us-                 least 3 UK news accounts from the following
ing the categories provided by the New York Times,                 list: @BBCNews; @guardian; @Telegraph; @In-
which they had to manually assign to their Twitter                 dependent; @MailOnline; @DailyMirror; @The-
dataset. Moreover, they used three topic types to dis-             Sun; @daily_express; @metrouk; @daily_star;
tinguish topics: event-oriented, entity-oriented, and              @standardnews;
    3. For each collected user, fetch their followed ac-       users; on average, each user follows 621.42 users. Users
       counts.                                                 from “Sample UK News Followers 20” follow a total
                                                               of 22,082 distinct users; on average, each user follows
    4. At the same time, for each followed account, fetch      1,104.10 users.
       and store all their tweets for the past 14 days.           The tweets for each user’s followed account were col-
    5. Locally, for each collected user, retrieve its stored   lected for a period of 14 days, with slightly different
       followed account timelines, ordered by decreasing       start dates, resulting in an overall larger period of 55
       date, thus rebuilding the home timelines per user.      days, from Jul 19 2016 to Sep 12 2016. The timespan
                                                               for the home timelines of the 40 users in both samples
   Each collected user, described in steps 1 and 2, was        only overlapped for a period of 13 consecutive days,
subject to a set of criteria to ensure a minimum level         from Jul 20, 2016, to Aug 2, 2016. We analyzed the
of expected activity and connectivity of the accounts          average number of tweets over time, per day and per
(a basic check to discard inactive users):                     hour, respectively, for each sample. While “Sample
                                                               Random 20” is moderately stable per day, with a co-
    • The user must have created at least one tweet in         efficient of variation of 29.1%, “Sample UK News Fol-
      the last three months.                                   lowers 20” shows a more evident growth in the number
                                                               of tweets, peaking at Jul 29 and having a coefficient of
    • The user must have at least three followers.             variation of 42.0%. Regarding the average number of
                                                               tweets per hour, the maximum number of tweets for
    • The user must have created at least five tweets          “Sample Random 20” was generated at 20:00 UTC, Jul
      since the creation of the account.                       23, 2016 and at 16:00 UTC, Aug 1, 2016, worldwide,
                                                               for “Sample UK News Followers 20”, with coefficients
   The data was stored in an SQLite database. In
                                                               of variation of 33.3% and 42.9%, respectively.
order to define and describe each user sample, we
used a “user_samples” table where we stored groups
of user IDs, identified by a common sample ID. Each            5     Information Consumption
“user_sample” entry also contained a textual descrip-          When social media paved the way for pervasive com-
tion detailing the data collection approach, as well as        munication, people became both producers and con-
the user selection criteria (e.g., “Random users, gen-         sumers. This introduced a shift in habits with po-
erated by a random uniform sampling of Twitter user            tential implications to the quality and diversity of the
IDs between 1 and 3954358701, restricting language to          consumed information. In order to better understand
’en’, last tweet date to 2015-07-15 16:45:43, follower         the impact of this change, we set to study how di-
count to 3 and status count to 5.”).                           verse timelines are, by focusing on what users con-
   In this paper, we characterize and compare the              sume, through their followed accounts. Our goal was
timelines for two user samples: “Sample Random 20”,            to answer the following questions: Do random users
which represents the baseline as a collection of ran-          and mainstream media followers have access to the
dom Twitter users, and “Sample UK News Followers               same information through different channels? Or do
20”, which represents a particular group of users who          the mainstream media still play a fundamental role in
have shown a general interest in mainstream media by           information diffusion that cannot be replaced by reg-
following well-known UK news accounts.                         ular Twitter users and “word-of-mouth”?

4      Data Characterization                                   5.1   Measuring Diversity
Overall, our collection contains 5,287,221 distinct            We aimed at characterizing and understanding the dif-
tweets. However, as different accounts frequently have          ferences between the content consumed by random
followed accounts in common, the timelines overlap,            users and the content consumed by users with a par-
resulting in 7,758,779 analyzable tweets when looking          ticular interest in mainstream media. Our approach
at individual home timelines. “Sample Random 20”               consisted of creating a user profile based on the tweets
contains 947,068 distinct tweets, resulting in 1,080,789       received in a user’s timeline. Each tweet was prepro-
(13.93%) of the overall analyzable tweets. “Sample             cessed by removing emoji, links, mentions, ‘RT’ and
UK News Followers 20” contains 4,685,800 distinct              punctuation, and by normalizing spacing, through the
tweets, resulting in 6,677,990 (86.07%) of the overall         conversion of multiple spaces, tabs and new lines to a
analyzable tweets. Distinct tweets in “Sample Ran-             single space and the trimming of the text. We then
dom 20” and “Sample UK News Followers 20” inter-               created a document per user, containing a concatena-
sect, resulting in 345,647 common tweets. Users from           tion of all preprocessed tweets that appeared in the
“Sample Random 20” follow a total of 11,807 distinct           user’s home timeline. Each document was converted
                   1.00
                                                                         5.1.1   Diversity over Time
                                                                         We used a similar approach to study diversity over
                                                                         time, but instead of using a single user profile per time-
                   0.75
                                                                         line, we created a document per day for each user. This
 Cosine Distance



                                                                         meant slicing the two original samples into 14 smaller
                   0.50
                                                                         parts, each part corresponding to one day, and repeat-
                                                                         ing the study for each day.
                                                                            Figure 2 depicts the dispersion of cosine distances
                   0.25                                                  between all pairs of timelines, per sample, over time.
                                                                         The daily behavior is consistent with the aggregated
                                                                         overall behavior, despite resulting in a slightly higher
                   0.00                                                  median cosine distance overall. This means that in-
                          Sample Random 20 Sample UK News Followers 20   formation consumption habits from random users are
                                                                         more diverse than mainstream media followers, but
                                                                         also that information consumption diversity for ran-
Figure 1: Cosine distances per sample, for all pairs of                  dom users is lower per day than overall for the 14 days
timelines.                                                               and, on the other hand, for mainstream media follow-
                                                                         ers, it is higher per day than overall. This is quite
to lower case and tokenized into unigrams, remov-
                                                                         expected, as the number of topics discussed in a single
ing stopwords from several languages1 and obtaining
                                                                         day are intuitively less than those discussed through
a document-term matrix, with the absolute term fre-
                                                                         the course of two weeks.
quencies, per sample. Sparse terms were then pruned,
ensuring a maximum sparsity of 0.996. This means
that rare terms with more than 99.6% zeros, that were                    6   Conclusions
less useful in distinguishing user profiles, were simply                 We have provided a consistent methodology to study
discarded.                                                               the anatomy of a sample of Twitter timelines, focusing
    The resulting document-term matrix for “Sample                       on content production and consumption, as well as on
Random 20” contained 19 documents and 228,165                            measuring overall and daily diversity. We studied the
terms — meaning that one of the users received no                        home timelines of two user samples: “Sample Random
tweets during for the time span of the collection — and                  20”, a random selection of users based on their numeric
the document-term matrix for “Sample UK News Fol-                        ID, and “Sample UK News Followers 20”, a selection
lowers 20” contained 20 documents and 389,831 terms.                     of users that followed at least 3 out of 11 mainstream
In order to understand how diverse each timeline was,                    UK newspaper accounts.
within either sample, we computed the cosine distance                       We found that mainstream media followers consume
from each timeline to all others in the same sample.                     a larger volume of information than random users.
Timelines that are highly diverse will consistently have                 We analyzed the overall and the daily diversity over
a high distance to most of the other timelines. Sim-                     the course of two weeks, based on the cosine distances
ilarly, a sample will contain highly diverse timelines                   between all pairs of timelines, per sample. Both the
if the overall distances between all timelines are high,                 overall and the daily diversity were consistently lower
that is, timelines within a given sample considerably                    for the timelines of mainstream media followers, when
diverge in consumed content.                                             compared to the timelines of random users. Interest-
    Figure 1 shows the box plot of the cosine distances                  ingly, when analyzing the change from the overall two
between all pairs of timelines for each sample. As we                    week aggregations to the daily aggregations, the sam-
can see, in particular through the median, “Sample                       ples diversities slightly approximate, but still result in
Random 20” contains timelines that are more diver-                       a lower diversity within mainstream media followers
gent among themselves (median cosine similarity is                       and a higher diversity within random users.
0.87), while “Sample UK News Followers 20” contains                         Overall, we can say that, when compared to random
timelines that are much less divergent among them-                       users, mainstream media followers consume a narrower
selves (median cosine similarity is 0.33). We can say                    range of content, covering a smaller number of topics,
that mainstream media followers have less diverse in-                    with a higher production volume. This can be ex-
formation consumption habits when compared to a                          plained by the fact that users in this sample share a
random sample of users.                                                  common interest (i.e. UK news), as opposed to the
   1 We considered English, French, Spanish, Portuguese, Ara-            users in the random sample that have no common
bic, Russian, Greek and Hindi, but also typical expressions used         characteristic. As expected, mainstream media fol-
in Twitter, like ‘via’ or ‘vs’.                                          lowers consume a less diverse variety of content. This
                                                        Sample        Sample Random 20       Sample UK News Followers 20


                  1.00
Cosine Distance




                  0.75


                  0.50


                  0.25


                  0.00                              ●




                         Jul 20   Jul 21   Jul 22   Jul 23   Jul 24    Jul 25   Jul 26     Jul 27   Jul 28   Jul 29   Jul 30   Jul 31   Aug 01   Aug 02
                                                                                     Time

Figure 2: Cosine distances per sample, for all pairs of timelines, per day. The lines correspond to a locally
weighted scatterplot smoothing (or LOESS, from LOcal regrESSion); they depict overall diversity per sample.
diversity is higher when we look at individual days,     [CHBG10] Meeyoung Cha, Hamed Haddadi, Fabrí-
probably representing the coverage of multiple topics               cio Benevenuto, and Krishna P. Gummadi.
throughout a day, but lower when we look at the two                 Measuring user influence in twitter: The
week period, probably representing the convergence of               million follower fallacy. In Proceedings of
topics.                                                             the Fourth International AAAI Conference
   In the future, we would like to analyze a larger sam-            on Weblogs and Social Media (ICWSM
ple of timelines, and also explore the diversity within             2010), 2010.
topic-based communities, such as those focused on a
given hashtag or those that share a geographical con-    [CSR14] Tiago Cunha, Carlos Soares, and Ed-
text.                                                               uarda Mendes Rodrigues. Tweeprofiles:
                                                                    detection of spatio-temporal patterns on
7 Acknowledgments                                                   twitter.     In International Conference
                                                                    on Advanced Data Mining and Applica-
José Devezas is supported by research grant                         tions, pages 123–136. Springer Interna-
PD/BD/128160/2016, provided by the Portuguese                       tional Publishing, 2014.
funding agency, Fundação para a Ciência e a Tecnolo-
gia (FCT). This work is partially funded by FourEyes,    [Jos06]    Lou Jost. Entropy and diversity. Oikos,
a Research Line within project “TEC4Growth –                        113(2):363–375, 2006.
Pervasive Intelligence, Enhancers and Proofs of
                                                         [KWM11] Efthymios Kouloumpis, Theresa Wilson,
Concept with Industrial Impact/NORTE-01-0145-
                                                                    and Johanna D. Moore. Twitter senti-
FEDER-000020”, financed by the North Portugal Re-
                                                                    ment analysis: The good the bad and the
gional Operational Programme (NORTE 2020), under
                                                                    omg!      In Proceedings of the Fifth In-
the PORTUGAL 2020 Partnership Agreement, and
                                                                    ternational AAAI Conference on Weblogs
through the European Regional Development Fund
                                                                    and Social Media (ICWSM 2011), pages
(ERDF).
                                                                    538–541, Barcelona, Catalonia, Spain, July
                                                                    2011. AAAI Press.
References
                                                                                         [LKSM14] Yabing Liu, Chloe Kliman-Silver, and Alan
[BNJ03]                    David M Blei, Andrew Y Ng, and
                                                                                                  Mislove. The tweets they are a-changin’:
                           Michael I Jordan. Latent dirichlet allo-
                           cation. Journal of Machine Learning Re-                                Evolution of Twitter users and behavior.
                           search, 3(Jan):993–1022, 2003.                                         In Proceedings of the Eighth International
                                                                                                  AAAI Conference on Weblogs and Social
[BNS13]                    Kevin Bache, David Newman, and                                         Media (ICWSM 2014), Ann Arbor, MI,
                           Padhraic Smyth. Text-based measures                                    June 2014.
                           of document diversity. In Proceedings
                                                                                         [MJA+ 11] Alan Mislove, Sune Lehmann Jørgensen,
                           of the 19th ACM SIGKDD international
                           conference on Knowledge discovery and                                   Yong-Yeol Ahn, Jukka-Pekka Onnela, and
                           data mining - KDD ’13, page 23, 2013.                                   J. Niels Rosenquist. Understanding the
                                                                                                   demographics of twitter users. In Pro-
                                                                                                   ceedings of the Fifth International AAAI
          Conference on Weblogs and Social Me-                     ceedings of the 27th annual international
          dia (ICWSM 2011), pages 554–557. AAAI                    conference on Research and development
          Press, 2011.                                             in information retrieval - SIGIR ’04, page
                                                                   520, 2004.
[POL10]   Saša Petrović, Miles Osborne, and Victor
          Lavrenko. The Edinburgh Twitter Cor-          [ZJW+ 11] Wayne Xin Zhao, Jing Jiang, Jianshu
          pus. In Proceedings of the NAACL HLT                    Weng, Jing He, Ee Peng Lim, Hongfei Yan,
          2010 Workshop on Computational Linguis-                 and Xiaoming Li. Comparing Twitter and
          tics in a World of Social Media, pages 25–              Traditional Media using Topic Models. In
          26, 2010.                                               Advances in Information Retrieval, pages
[WJ04]    Ryen W White and Joemon M Jose. A                       338–349. Springer Berlin Heidelberg, 2011.
          study of topic similarity measures. In Pro-