Birds of a Feather Tweet Together: Computational Techniques to Understand User Communities in Social Networks David Burth Kurka Alan Godoy Fernando J. Von Zuben (1) University of Campinas (1) University of Campinas University of Campinas (2) Imperial College London (2) CPqD Foundation vonzuben@dca.fee.unicamp.br d.kurka@ic.ac.uk godoy@dca.fee.unicamp.br ABSTRACT areas such as politics [6] and marketing campaigns [10, 20]. The study of social systems shows that there is a relation- This paper explores the interplay between social connec- ship of mutual influence between social connections and in- tions and content shared by individuals in an online social dividual behavior, known as homophily. In this work, we network, which is one aspect of social communication. We developed a methodology to allow the analysis of interests investigated here how it is possible to understand behav- of groups of users in Twitter network, based on automatic ior and interests of a group of individuals based on their community detection and tweets ranking. The techniques social connections, paying special attention to the role of presented reveal evidences that the presence of communities homophily – the tendency for someone to establish connec- is related to topic specialization, and allow the characteri- tions to similar peers. Analytical tools were developed – zation of elaborate profiles of groups of users based only on and applied to real data extracted from OSNs – to pinpoint their location on the network. content relevant to the understanding of interests of commu- nities and users composing them. The techniques used are based solely on the knowledge of the users that shared each CCS Concepts message, not resorting to their contents. We believe that, in •Networks → Online social networks; •Human-centered addition to indicating the relevance of sharing information computing → Social network analysis; Collaborative for the classification of messages, such tools can be useful to and social computing; the practical study of social systems and to the development of new applications, such as recommendation systems. The paper is structured as follows. In Section 2 we in- Keywords troduce relevant literature on OSN and homophily. In Sec- Online social networks, social network analysis, community tion 3, we present the methods used to acquire and analyze detection, homophily, complex systems. data. Then, in Section 4, we report and analyze the results obtained using our methodology. Finally, in Section 5, we 1. INTRODUCTION present final remarks, discussing implications of this work and possible future directions. The experience of the last decade has shown that online social networks (OSNs) are not only useful for the amuse- ment of Internet users, but can also be a valuable source of 2. BACKGROUND data for the study of social systems. The millions to billions An increasing number of studies have been conducted over of users who access OSNs services everyday are providing the last years focusing on online social networks [7]. Dynam- researchers an unprecedented possibility to gather informa- ics of information diffusion [3], public opinion prediction [19], tion from human activity and social behavior, enabling the collective sentiment analysis [9, 18] and the formation of so- investigation of complex issues [7]. cial structures [8] are among the subjects explored. One of the interesting topics that can be explored in this An important aspect investigated is the presence of self- context is the creation and diffusion of content. OSNs users organized processes. Despite not having central controllers are constantly involved in complex dynamics of message that rule on how content is disseminated or connections be- sharing, which may result in the emergence of new trends, tween users are created, OSNs display many organized be- collective mobilization and opinion formation. Understand- haviors. Common examples of such process are the collective ing the essential mechanisms of this process can be useful to curation of contents [17] and information diffusion cascades [5]. 2.1 Communities and Homophily Copyright c 2016 held by author(s)/owner(s); copying permitted Self-organizing processes also take part on how connec- only for private and academic purposes. Published as part of the #Microposts2016 Workshop proceedings, tions are formed in an OSN. Usually, social connections are available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691) not created uniformly between all individuals, but are con- #microposts2016 Montreál, Canada centrated in few hubs [1]. Thus, when analysing topologies #Microposts2016, Apr 11th, 2016, Montréal, Canada. ACM ISBN 978-1-4503-2138-9. of real networks, communities of individuals which are more DOI: 10.1145/1235 likely to be linked between each other, with fewer connec- · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 tions between individuals from different communities, are the first level indicating the partition and the second indi- found. Social systems exhibit communities in many differ- cating a individual inside the partition – that minimizes the ent levels: people inside families, sharing specific common expected description length of a random walk in the graph. interests, or living at the same city or nation have many more This partitions are, thus, the communities of individuals in connections to other in-groups than to out-groups [12]. the network. Considering the flow of information that travels over the network, groups with highly connected individuals may im- ply a redundancy of communication channels, possibly en- 3. METHODS hancing information that pass through or are generated by these groups. In the case of online social networks, it can 3.1 Data Acquisition imply that a content may be more easily proliferated and re- As Twitter has plenty of public data available online, it is inforced inside a community after it is shared by a member a good source for creating a database of social events. How- of such group. While communities are expected to influence ever, the rate limits imposed by its API1 hamper the down- on how their members receive and process information, it load of large volumes of data from events that took place in is also believed that community formation is influenced by the past. An alternative is to use Twitter’s Streaming APIs, pre-existing affinities between members [21]. which allows the download of messages in real time, as they Researchers identified in social networks outside the vir- are posted. tual world a tendency (not only within communities) of in- Therefore, in order to collect a satisfactory amount of data dividuals with common interests to be usually connected to to be analyzed, we decided to track the interactions between each other [11]. Such phenomenon is called homophily and a popular user account producer of original content and the is observed, also, in OSNs. users that share (i.e. retweet) these messages. As popular Kwak et al. [8], in an early study analyzing Twitter’s data, Twitter accounts interact with many users daily, this ap- showed evidences of homophily among users with same lo- proach revealed to be an effective way for collecting message calization and same number of friends (popularity). Also diffusion processes and user interactions as they happen. on Twitter, Wu et al. [22], found a strong tendency where We chose the Brazilian largest newspaper Twitter account, users belonging to a same category (e.g artists, organiza- Folha de São Paulo2 , and collected: original messages posted, tions, bloggers) would communication among themselves. retweets of those messages posted by other users, account Romero et al. [14] studied the relationship between the (ex- details of those users and the relationships (followers and plicit) network of friendship and the (implicit) network of followees) of all of them. topical affiliations – i.e., the communities formed by users interested in a common topic. They showed that both net- 3.2 Automatic Topic Classification works have considerable intersection and that users tend to As with many other newspapers, almost all messages pub- connect to other users with common interests. This corre- lished by the chosen source are headlines, followed by a link lation allows the prediction of friendship connections from to the newspaper’s website with the news’ full content. As hashtag diffusions and also the forecast of the future popu- the news articles on the website belong to thematic cate- larity of a hashtag from the friends network. gories (newspaper’s sections), it was possible to automati- Bollen et al. [2] verified that users’ emotions is strongly cally attribute a class to each tweet, based on these cate- correlated with social connections, showing that users con- gories. This procedure was carried out to all tweets and six sidered happy tend to be linked to each other. Salathé et most common topics were verified, namely: “daily life news”, al. [16] explored how a network with signs of homophily “sports”, “world”, “politics”, “entertainment” and “market”. interfere on the spread of sentiment towards a new vaccine, showing how negative opinions can be reinforced in such en- 3.3 Detecting common interests in communi- vironments. ties After collecting and classifying all data, we then evalu- 2.2 Communities Detection ated whether retweeting behaviors of communities’ mem- Some of the most common approach to community detec- bers are coherent among themselves regarding the subjects tion are modularity-based algorithms [13], which look to par- of the shared messages. In order to detect groups of tightly tition a network in communities so that the summed weight connected users from the social connections observed, we of all connections between two communities is minimized. executed Rosvall and Bergstrom’s community detection al- This approach, as most other techniques usually applied for gorithm [15] in our social network, an algorithm focused on community detection, however, is insensitive to direction of locating groups of users among which information can flow connections in the network. In systems where patterns of more quickly. flow among individuals are relevant, however, ignoring con- Understanding the interests of a community is a hard task. nection direction may disregard information valuable to the To address such issue, we propose here an adaptation of a comprehension of collective behavior. well-known statistics from text mining, the term frequency- In order to address this issue, Rosvall and Bergstrom inverse document frequency (tf-idf ) [4]. [15] proposed a flow-based method, which defines a com- The tf-idf method usually considers a corpus of docu- munity as a set of individuals “among which information ments, each composed of different terms. By comparing the flows quickly and easily”. Their algorithm takes advantage frequency of specific terms inside and outside documents, of both direction and weights of connections, using infor- values are assigned to each term, pondering its importance mation theoretical measures and a random walk as a proxy 1 of information flow. Using a greedy search, they look for a https://dev.twitter.com/rest/public/rate-limiting 2 partition of individuals that define a two-level description – https://twitter.com/folha 22 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 for each document. The tf-idf of a term t in a specific doc- ument d is calculated as follows: Table 1: Characterization of the collected data. Number of users 44320 Connections between users 673982 tf idfd (t) = tfd (t) × idf (t), Average degree 30.42 where tfd (t) is a value that is higher the higher the frequency (hkin i = hkout i = 15.21) of the term t in d while idf (t) is inversely proportional to Most followed user @rodrigovesgo (Come- the frequency of t in all documents of a corpus. dian) – 5742 followers In the traditional use of the tf-idf algorithm in text-mining, Most popular message 743 retweets a term is a word and a set of terms is a textual document Clustering coefficient 3.66% (C for a random- (e.g. a book, or a webpage). This creates a document-term ized network is 0.07%) matrix, where each row represents a document, each column Diameter 15 a word and each cell, indexed by (i, j), the frequency of the term j in document i. For the case of our application, we Average path length 4.29 adapted this approach by considering each tweet as a term Strongly connected compo- 12358 and each community as a document. Therefore, we built a nents community-tweet matrix where each cell (i, j) represents the Size of giant component 31493 (71% of all users) number of retweets of tweet j in community i. Total number of retweets 110389 (2.49 per user / Thus, when the same operations of tf-idf are applied to a 45.16 per message) community-tweet matrix, tweets that are both highly shared by members of a community and more akin to such commu- nity’s behavior are highlighted with higher scores than the Folha de São Paulo, was removed from the network, so that others. the interactions among its followers became the focus of the analysis. Table 1 presents a more complete characterization 4. RESULTS of the collected data and the network formed between users. It is worth pointing, also, that most users do not participate 4.1 Database Description actively on the diffusions, implying in high diversity of users From March 19, 2014 to September 21, 2014, all messages participating in the processes, but low recurrence: during (tweets) posted by the source account, as well as any share the observation period, each user retweeted in average only of this content by other users (retweets) were collected. two messages from the source. It was possible to track, from the data collected, a large amount of information diffusion processes triggered by the 4.2 Collective Coherence in Communities observed source. During the observed period, 13463 distinct After running the community detection algorithm, 4278 and original messages posted by the source account were communities were detected, many with 2 elements (2 con- collected. From this set, a series of filters were sequentially nected users, isolated from the rest of the network), but 26 applied, forming a more appropriate dataset for the work, larger groups with 200 users or more have also been identi- as described below: fied. A first experiment involving communities consisted of check- • Filter 1 - Only messages which had received at least ing how coherent communities behaviors were, by comparing 20 retweets were selected, resulting in a group of 4671 the frequency at which specific messages were shared inside distinct messages; communities and in the complete network. For each com- munity with a relevant number of users (200 or more), it • Filter 2 - Messages not belonging to one of the six main was computed the number of retweets for each Folha de São categories (“everyday news”, “sports”, “world”, “poli- Paulo’s original tweet. A high coherence was verified in the tics”, “entertainment” and “market”) were removed from behaviors of individuals belonging to the same community. the above set, resulting in a group of 3185 messages; Table 2 compares the sharing rates inside and outside the four largest communities, showing the five most distinctive • Filter 3 - Users who retweeted more than 60% of the cases where the community has singular sharing behavior, publications were considered automated scripts (bots) differentiating from the global behavior. and therefore removed from the database and from the The cases shown on Table 2 indicate that there is a cer- retweets count (just one user was identified as such); tain level of coordination in the selection of shared messages • Filter 4 - As in 2014 Brazil held the Football World by each community. It is interesting to see how some mes- Cup and presidential elections, there were a high num- sages are much more emphasized inside a community, than ber of messages in categories “politics” and “sports”. it was in the general network. Community 2, for example, In order to balance the proportion of messages in each present higher sharing rates for its messages, compared to topic, a maximum limit of 450 messages was set for the rest of the network. Equally interesting are cases where each category. the community seems to suppress the spread of a message, as seen in community 8, where highly retweeted messages The filtered data resulted in a collection of 2444 messages, are less emphasized by the community’s members. of which 44320 distinct users retweeted one or more mes- For comparison purposes, an attempt to break the rela- sages at some moment. From the users lists of followers and tionship between user connections and postings was made, followees, it was possible to characterize a network, register- by randomly swapping the retweet pattern of different users ing the social connections between them. The source user, on the database, while keeping their social connections. Thus, 23 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 dex3 for their number of retweets within each community in Table 2: Comparison of the frequencies of retweets the real data with the same index for the data in the random- between inside and outside the four largest commu- ized dataset. Our results indicate that the real communities nities. have stronger preferences for specific messages, with satis- Community 4, members: 4870, total tweets: 8978 factory statistical significance when considering all commu- 1st 2nd 3rd 4th 5th nities (Wilcoxon test, p = 1.23 ∗ 10−15 , Gini index difference Inside 0.7% 1.0% 0.8% 0.1% 0.3% of 1.5 ∗ 10−4 ) and even more pronounced when restricting Outside 1.8% 1.8% 1.5% 0.6% 0.8% the comparison only to larger communities, with at least 50 individuals (Wilcoxon test, p = 2.13 ∗ 10−4 , Gini index Community 1, members: 2621, total tweets: 12292 difference of 8.26 ∗ 10−3 ). 1st 2nd 3rd 4th 5th Inside 2.5% 1.9% 2.1% 1.7% 1.4% 4.3 Topic Specialization Outside 0.6% 0.3% 0.6% 0.2% 0.0% A second experiment consisted in the analysis of how top- Community 2, members: 1022, total tweets: 3160 ics are distributed among communities. For this, the six 1st 2nd 3rd 4th 5th standard categories were considered and the number of mes- Inside 4.9% 3.7% 3.7% 3.6% 3.4% sages of each category shared by each community was com- puted. First, Figure 1a shows the general distribution of Outside 0.2% 0.1% 0.2% 0.4% 0.4% messages per topic for the whole network. Then, Figures 1b- Community 8, members: 742, total tweets: 2117 1h show the distribution for seven distinctive communities 1st 2nd 3rd 4th 5th chosen among those with more than 200 members. The fig- Inside 0.1% 0.0% 0.9% 0.8% 1.1% ures demonstrate how communities’ topic distribution may Outside 1.7% 1.4% 0.1% 0.1% 0.3% have different profiles compared to the rest of the network. From the graphs presented, communities 1 and 16 (Fig- ures 1b and 1f) seems to have a stronger interest in political under this new experiment, the communities remain defined issues, with community 1 showing an interest in daily life as they were originally, but their retweets have patterns from news slightly above the average. Community 12 (Figure 1e) users of different communities, in order to eliminate any ho- also shows a higher interest in politics, but divide it with mophily related to retweeting behavior but keeping intact a focus on sports news. Community 21 graph (Figure 1g) other characteristics of our data (as global retweet counts shows a preference for daily life news, but not in a very dis- and correlations between sharings by individual users). The tinctively way. Community 10 (Figure 1d) does not have same analysis made before is now performed on the ran- a distribution very different from the whole network (Fig- domized dataset, resulting on Table 3. It becomes evident ure 1a), being a representative of the average preferences. that when homophily is suppressed, communities lose their Community 25 (Figure 1h), in turn, is remarkably different, particular behavior and present sharing rates closer to the with over 50% of its retweets being about entertainment and whole network rates, indicating that differences in retweet- practically all the rest regarding sports news, almost ignor- ing patterns are not only artifact of communities’ finite sizes. ing the other topics. 4.4 Detecting Relevant Messages Table 3: Comparison of the frequencies of retweets By applying the tf-idf normalization on the retweet counts between inside and outside communities for the ran- of the communities we identified the most characteristic tweets domized dataset. for each community. Table 4 presents the top five messages Community 4, members: 4870, total tweets: 12214 for the largest communities in the network. The messages 1st 2nd 3rd 4th 5th content were translated from Portuguese to English, with some translation notes (in brackets), where necessary. Inside 1.3% 0.8% 0.9% 0.2% 0.6% It was possible to deepen the analysis of each commu- Outside 1.7% 0.5% 0.7% 0.5% 0.4% nity profile beyond what would be possible by simply look- Community 1, members: 2621, total tweets: 6967 ing to the distribution of general topics in each community. 1st 2nd 3rd 4th 5th Analysing each group of messages individually, it is possible Inside 1.3% 0.5% 0.6% 0.4% 0.5% to notice specific and subjective categories. For example, Outside 1.0% 0.2% 0.3% 0.1% 0.2% although communities 1 and 16 both have an emphasis in politics, community 1 seems to be supportive to the Brazil- Community 2, members: 1022, total tweets: 2505 ian government – focusing on good results of politics made 1st 2nd 3rd 4th 5th by the government and scandals of the opposition – while Inside 1.8% 1.4% 1.1% 0.9% 2.0% 16 appears more involved in topics related to (at the time) Outside 1.0% 0.6% 1.7% 0.3% 1.4% election’s opposition candidates. A very interesting conclusion comes from the analysis of Community 8, members: 742, total tweets: 1619 the relevant messages from community 12, as we discover 1st 2nd 3rd 4th 5th that the tweets are not connected by the newspaper sections, Inside 0.9% 1.1% 1.1% 0.7% 1.2% 3 Outside 0.2% 1.7% 0.5% 0.1% 0.7% The Gini index is a measure of how unevenly a value is distributed among elements of a group – in our case, if the Gini index is close to 1 then most of the retweets seen in a To verify how strong was the preference of members of a community were associate with few messages, if its value is community for specific messages, we compared the Gini in- near 0, then the distribution of retweets is closer to uniform. 24 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 Table 4: tf-idf results, showing most representative tweets in selected communities. Community 1 – members: 1022, retweets: 3160 tf-idf Category Tweet’s content 4.09 politicsMost Minas [Brazilian state] voters are unaware of airport made by Aécio [oposition candidate for presidency]. http://t.co/9pmkp1yS1P 4.02 market Oil production in the country grows almost 15% and hit record, says ANP. http://t.co/Z9kgIRmoIh 3.70 daily life Book about Lula [former president] will be the last of my career, says writer Fernando Morais. http://t.co/ jw84KtniZw 3.67 daily life Brazil has reduced by 50 % the number of people suffering hunger, the UN says. http://t.co/mOVqNYUGUa 3.63 market Govern expands My House [housing program] in 350 thousand units in the first half of 2015. http://t.co/ zAoTS3vjyU Community 9 – members: 272, retweets: 636 tf-idf Category Tweet’s content 3.79 sports Brazilian national football team will play in the new stadium of Palmeiras [footbal team] http://t.co/V8DoEI42bm 3.69 sports Palmeiras was born champion with Oberdan Cattani in goal. http://t.co/5th4sBM553 3.68 sports Maurı́cio de Sousa [Brazilian cartoonist] makes drawing in honor of Palmeiras centenary. http://t.co/3CBPgidV9O 3.51 sports Cristaldo scores, Palmeiras beats Criciúma [footbal team] and wins 1st with Dorival. http://t.co/MJZNLxUEA6 3.35 sports Fans flock to the streets to wait centenary of Palmeiras. http://t.co/uwdrcb5wCF Community 10 – members: 283, retweets: 861 tf-idf Category Tweet’s content 4.06 market With wicked face, Harley-Davidson Fat Boy Special is sweet to drive. http://t.co/FeyVtebUX7 3.99 world Pope Francis says corrupts will be held accountable to God. http://t.co/dd77QEnUei 3.89 market Federal prosecutor’s office of São Paulo denounces Eike Batista [business man]. http://t.co/SVQCBYPVPd 3.81 entertain85 years-old, the cult filmmaker Alejandro Jodorowsky conquers Twitter with philosophy and mysticism pills. http://t.co/wv9A1ggxâĂe, 3.75 daily life Alckmin [São Paulo state governor] will sanction this week bill prohibiting masks in protests. http://t.co/ EkMmwM7jHk Community 12 – members: 316, retweets: 820 tf-idf Category Tweet’s content 5.03 politics With police strike, army and national security force will secure Pernambuco [Brazilian state]. http://t.co/ 7ynDAQwS8u 4.78 sports Court declares Sport [Pernambuco’s football team] as the sole champion of 87; Flamengo [footbal team] can go to the Supreme Court. http://t.co/BeolyRwnpd 4.51 politics PE [Pernambuco state] says that it will only negotiate with PM [police] if strike is over. http://t.co/J3qL9ZBjic 3.91 sports Suspect of trowing toilet bowl that killed fan is arrested in Recife [Pernambuco capital] http://t.co/m93KTdGagq 3.83 sports #FolhaintheWorldCup skewer will cost R$ 15 in the World Cup stadiums. View the price of other items: http: //t.co/k4F8QPJ4RP Community 16 – members: 215, retweets: 527 tf-idf Category Tweet’s content 4.15 politics Even with rain, population attends Campos’ [PSB’s presidential candidate dead in an airplane crash] funeral. http://t.co/1mkl0IpO4W 4.08 politics In Maranhão [Brazilian state], Campos says that will send Sarney [politician] to the opposition. http://t.co/ 0qMC9uxfvw 3.85 politics Sarney tells Dilma [Brazilian president] that he won’t be a candidate anymore. http://t.co/Dks1zZmE02 3.77 politics Frederico Vasconcelos: Leandro Paulsen: ”It would be good to have a tax in the Supreme Court. http://t.co/ 5IPV53igmh 3.75 politics In video recorded inside Papuda prision, José Dirceu [arrested politician] complains about the closed regime. Watch: http://t.co/7gmRr9CwLU Community 21 – members: 268, retweets: 745 tf-idf Category Tweet’s content 4.32 politicsSP [São Paulo state] subway workers reject agreement and decide to go on strike on Thursday. http://t.co/ sxcervHS2x 3.81 daily life Drivers and conductors block garages in Osasco and Diadema [cities near São Paulo]. http://t.co/tULggQlUDq 3.74 politics #FolhaintheWorldCup Chamber of deputies reject holiday during Brazil’s matches. http://t.co/Cuw9ue6qNa 3.69 daily life Subway does a campaign against sexual harassment in train and subway stations. http://t.co/uVJo0mX3dD 3.58 world Chinese ship detects signal that can be the black box of missing plane, says the agency. http://t.co/X9HP1bmUgG Community 25 – members: 217, retweets: 254 tf-idf Category Tweet’s content 3.81 entertain Aged 22, a member of One Direction buys English football team. http://t.co/HoxsKSpXY9 3.06 world Data protection in the digital age requires “increased attention”, says Dilma. http://t.co/CXh5pQuCRf 3.01 entertain American college offers lectures on Miley Cirus. http://t.co/KjtiKrKcnc(via@sitef5) 2.87 market Banks plan to ‘extinguish’ DOC [money transfer format] until 2015. http://t.co/9CnHHUNES1 2.87 daily life Demonstration blocks roads of the east side of São Paulo city. http://t.co/ix7I5kIukU 25 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 (a) Whole network (b) Community 1, (c) Community 9, (d) Community 10, members: 1022, members: 272, members: 283, retweets: 3160 retweets: 636 retweets: 861 (e) Community 12, (f) Community 16, (g) Community 21, (h) Community 25, members: 316, members: 215, members: 268, members: 217, retweets: 820 retweets: 527 retweets: 745 retweets: 254 Figure 1: Topic distribution in the entire network and within the largest communities. In the histograms, topics are represented as follows: (1) “daily life news”; (2) “sports”; (3) “world”; (4) “politics”; (5) “entertain- ment”; (6) “market”. but by subjects regarding the Brazilian state of Pernambuco as ethnic background, social class, mood, etc. The results and its capital, Recife. All the messages presented were we presented highlight the relation between shared interests related to events taking place in the state, involving both and Twitter’s structure. politics matters (police strike) and sports (2014 FIFA World The use of tf-idf jointly with community detection was Cup events). This analysis shows both the limitations of the able to group and order messages according to their rele- standard categorization of topics (the six classes defined by vance to a social community, enabling the characterization the newspaper) and the potential of the tf-idf technique on of complex behavior profiles inside communities. The pre- revealing new subjective connections among messages. sented method was able to reveal more nuanced classes of Another strong topic specialization is noticed in commu- contents, such as political positions, regional matters, fan nity 9, where all the top five messages are related to the clubs, that were not covered in the original six categories, football team Palmeiras, giving evidence that the commu- defined by the newspaper’s staff with the specific purposes of nity consists mainly of the team’s supporters. Community organization and classification. It is relevant to notice that, 25 is specialized in entertainment topics, showing an appar- beyond Twitter, the same technique can be applied to the ent tendency to emphasize messages related to international analysis of different sets and databases from social networks, pop culture. Interestingly, although the topic distribution enabling similar studies for different services and contexts. in this community was predominantly on entertainment and The empirical results of this study show new concrete evi- sports, the tf-idf normalization reveals the relative relevance dences of how individual behavior and social connections are of messages in other categories, such as world and market closely related, as expected in theory. In fact, so much infor- (market is the least shared topic among the six categories). mation is present in social connections that it was even pos- Sports tweets were not present among the top five messages, sible to find groups of similar messages without any knowl- which can probably mean that the sports messages shared edge of their content, using techniques that only consider by the community followed the general distribution, not re- network properties and sharing behaviour. This is reflected vealing a distinctive behavior of the community. on the list of the most representative messages for each com- This kind of qualitative analysis of communities behavior munity (Table 4), which are usually related to few subjects. could be made with most communities detected in the net- Accordingly, by aggregating information about which com- work, but are not presented here, due to space constraints. munities interacted with some object (a tweet, for instance) Other examples of topic specialization present in communi- to other techniques (e.g., natural language processing), it ties include: regional news (from diverse Brazilian states), may be possible to gather new knowledge about such object international politics, economy, football discussions (in gen- that are not evident in it (as social or geographic contexts). eral and regarding specific teams) and corruption. In an extended perspective of this result, one can wonder if it might be possible to infer significant elements about the nature of a process happening on a social network even 5. DISCUSSION without access to the content traveling through such net- This research presents a computational framework for gen- work, if there is information available about its structure eral investigations on collective behavior. When applied to and dynamics. a large dataset, the method presents new evidences of ho- Ethical implications of the power of such methods for de- mophily in Twitter’s network. Despite the existence of ho- tecting users specific interests, without the direct access to mophily in Twitter was already found in different studies [2, their personal information, should be considered. If, on one 8, 22], homophily may be based on many different criteria, hand, this knowledge can be used in order to improve the 26 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 performance of useful systems, as machine learning algo- [8] H. Kwak, C. Lee, H. Park, and S. Moon. What is rithms, on the other hand it may also incur in risks to pri- Twitter, a social network or a news media? In vacy and security. This discussion is not conducted in details Proceedings of the 19th International Conference on here, but should not be forgotten. World Wide Web (WWW ’10), page 591, New York, The communities observed had elements of cohesion on New York, USA, 2010. ACM Press. their general behaviors, emphasizing or even repressing the [9] T. Lansdall-Welfare, V. Lampos, and N. Cristianini. spread of certain types of content. An interesting conflict Effects of the recession on public mood in the UK. In between individual autonomy and collective behavior seems Proceedings of the 21st International Conference to be part of information diffusion processes that take place Companion on World Wide Web (WWW ’12), pages on OSNs self-organized in communities. The community 1221–1226. ACM, 2012. specialization in topics of interest is also evidenced. On a [10] J. Leskovec, L. Adamic, and B. Huberman. The context of proliferation of many different subjects, the limi- dynamics of viral marketing. ACM Transactions on tation of the scope of themes discussed within a community the Web, 1(1):5, May 2007. can be an efficient strategy for individuals to deal with in- [11] M. McPherson, L. Smith-Lovin, and J. Cook. Birds of formation overload. a feather: Homophily in social networks. Annual Further steps of the research include a deeper analysis of Review of Sociology, 27(1):415–444, 2001. a database including more messages from multiple sources, [12] M. Mitchell. Complexity:A Guided Tour. Oxford using text-mining to define messages subjects and compar- University Press, 2009. ing such classification to results obtained by tf-idf. In future [13] M. Newman. Modularity and community structure in studies, the homophily of sharing behavior in online social networks. Proceedings of the National Academy of networks can be subject of a deeper analysis, developing new Sciences, 103(23):8577–8582, 2006. methods to try to determine how much of it is due to (1) preference of individuals to establish new social connections [14] D. Romero, C. Tan, and J. Ugander. On the interplay to similar peers; (2) social influence; (3) indirect homophily, between social and topical structure. arXiv:1112.1115 which occurs due to the existence of homophily of another [cs.SI], 2011. trait (e.g., if two individuals usually access Twitter in the [15] M. Rosvall and C. Bergstrom. Maps of random walks same hours of the day, despite of their connections they will on complex networks reveal community structure. be more likely to read the same news and, thus, retweet it). Proceedings of the National Academy of Sciences, An even deeper analysis can be made about social influence, 105(4):1118–1123, 2008. in order to be able to divide it into its reactive part – an in- [16] M. Salathé, D. Q. Vu, S. Khandelwal, and D. R. dividual exhibits a sharing behavior in favor of some subject Hunter. The dynamics of health behavior sentiments because his/her community publishes more about such sub- on a large online social network. EPJ Data Science, ject, not because of an inner preference – and its cognitive 2(1):4, Apr. 2013. part – by observing his/her peers’ behaviors, an individual [17] A. Sarcevic, L. Palen, J. White, K. Starbird, shapes his/her preferences according to those practiced by M. Bagdouri, and K. Anderson. “beacons of hope” in his/her peers. decentralized coordination. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative 6. REFERENCES Work (CSCW ’12), New York, New York, USA, 2012. [1] A. Barabási and R. Albert. Emergence of scaling in ACM, ACM Press. random networks. Science, 286(5439):509–512, Oct. [18] S. Stieglitz and L. Dang-Xuan. Political 1999. communication and influence through microblogging – [2] J. Bollen, B. Goncalves, G. Ruan, and H. Mao. an empirical analysis of sentiment in Twitter messages Happiness is assortative in online social networks. and retweet behavior. In 2012 45th Hawaii Artificial life, 17(3):237–251, Jan. 2011. International Conference on System Science (HICSS), [3] J. Borge-Holthoefer, R. Baños, S. González-Bailón, pages 3500–3509, 2012. and Y. Moreno. Cascading behaviour in complex [19] A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe. socio-technical networks. Journal of Complex Predicting elections with Twitter: What 140 Networks, 1(1):3–24, 2013. characters reveal about political sentiment. In [4] M. Dillon. Introduction to modern information International AAAI Conference on Weblogs and Social retrieval. Information Processing & Management, Media, 2010. 19(6):402–403, 1983. [20] D. Watts and P. Dodds. Influentials, networks, and [5] S. Goel, D. Watts, and D. Goldstein. The structure of public opinion formation. Journal of Consumer online diffusion networks. In Proceedings of the 13th Research, 34(4):441–458, 2007. ACM Conference on Electronic Commerce (EC ’12), [21] F. Wu, B. a. Huberman, L. a. Adamic, and J. R. volume 1, page 623, New York, New York, USA, 2012. Tyler. Information flow in social groups. Physica A: ACM Press. Statistical Mechanics and its Applications, [6] A. Halu, K. Zhao, A. Baronchelli, and G. Bianconi. 337(1-2):327–335, June 2004. Connect and win: The role of social networks in [22] S. Wu, J. M. Hofman, W. a. Mason, and D. J. Watts. political elections. EPL (Europhysics Letters), Who says what to whom on twitter. In Proceedings of 102(1):16002, Apr. 2013. the 20th International Conference on World Wide [7] D. Kurka, A. Godoy, and F. Von Zuben. Online social Web (WWW ’11), page 705, New York, New York, network analysis: A survey of research applications in USA, 2011. ACM Press. computer science. arXiv:0707.3168 [cs.SI], 2015. 27 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016