Duplicate Removal for Overlapping Clusters: A Study Using Social Media Data Amit Paul and Animesh Dutta Department of Computer Science and Engineering, NIT Durgapur, India, amitpaul06@gmail.com Department of Computer Science and Engineering, NIT Durgapur, India, animeshnit@gmail.com Abstract et al. 2011), identification of influencers (Cha et al. 2010; Kiss and Bichler 2008), identification of communities (Lee The social media is a labyrinth of information which when et al. 2010; Mishra et al. 2007; Zhang and Yu 2015; Duan uncovers, provides a deep insight into the real-world happen- ings. In this study, we use social media Twitter to create user et al. 2014; Gregory 2008; Whang, Gleich, and Dhillon groups or clusters using the retweet and reply directed links. 2016), determination of the geographic location of users The main idea behind creating the groups is to figure out a (by message contents) (Cheng, Caverlee, and Lee 2010; user’s best suited place and to generate crisp clusters. Each Chandra, Khan, and Muhaya 2011) and using user location user forms a group and thus numerous overlapping groups in profile (Hecht et al. 2011), sentiment analysis and opin- or clusters are created. To get crisp clusters, we present an ion mining (Kouloumpis, Wilson, and Moore 2011), deter- algorithm for removing duplicates in cluster configurations mining who is “following” / “friends with” / “connected to” that feature a significant amount of overlapping. The idea pre- whom (Brzozowski and Romero 2011; Kwak et al. 2010), sented in this paper is that we consider numerous overlapping trend identification (Gloor et al. 2009), and “hot spot” de- clusters in a cluster set and proceed in a manner where each tection (Li and Wu 2010) (indicating some natural disaster) cluster is compared with a set of users. The user set is cre- ated from these clusters. The proposed algorithm deletes all (Kryvasheyeu et al. 2016). duplicates and is compared to a naive algorithm. Moreover, a Generally, there are three ways to analyse Twitter data: the modified algorithm is also proposed whereby selected dupli- social network analysis, content analysis and context analy- cates are kept based on most significant position of the user sis. Many works have been carried out using message con- among all clusters in the configuration. This does not guaran- tent while valuable retweet information is neglected (Bild et tee that all duplicates will be removed. But, as shown in the al. 2015). In this paper, we are considering retweet and re- study a majority of duplicates are removed. Both the proposed ply directed links to identify user groupings or clusters. A and modified algorithm are lot faster than the naive one. This reweet is a forwarded message from a user to his follow- domain was selected because its a domain where we wish to identify unique user communities (clusters) and where large ers. This is of interest because it tells us who is connected to amount of overlap typically exists. After duplicate elimina- whom, or in Twitter jargon who is “following” whom. More- tion, we are left with few clusters which are much bigger in over, a user in the Twitter network can retweet any other size than other clusters in the cluster set. user’s tweet and this shows the topical interest of the user who retweets the tweet of another user. This allows us to group (cluster) users, according to whom they are “follow- Introduction ing”, which in turn is of interest with respect to a variety of Social networking sites generate extensive amounts of data. socio-economic applications such as recommending follow- It is generally acknowledged that embedded within this ers, recommending feeds for tweeting etc. However, unlike data there is a lot of useful, domain dependent, knowl- in the case of conventional clustering algorithms, grouping edge (Adedoyin-Olowe, Gaber, and Stahl 2014). The chal- users in this way typically results in numerous overlapping lenge is to identify and extract this knowledge in a man- clusters (groups of users). Individual Twitter users typically ner whereby it can be meaningfully utilized. One popular follow many others, and are typically followed by many oth- mechanism for attempting to do this is to use data min- ers. On an average a Twitter user has 208 followers although ing technology (Srivastava 2008; Jensen and Neville 2003; the variance is considerable1 . Since a user may be following Barbier and Liu 2011). Examples of where data mining numerous other users he may belong to different commu- technology has been applied to social network data in- nities and thus the overlap. Furthermore, Twitter does not clude: content analysis (Naaman, Boase, and Lai 2010; Wu require a user to be a follower of someone to retweet their content and thus this also increases the chance of overlap- Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. ping since a single user can retweet many tweets of other Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of the AAAI 2019 Spring Symposium on Combining Machine Learn- 1 ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford Twitter statistics and facts (August 2016), http:// University, Palo Alto, California, USA, March 25-27, 2019. expandedramblings.com/index.php/. users and vice-versa. to get unique groups or communities where overlapping is Overlapping clusters (user groupings) may not always minimized. Our problem is for exact duplicate removal. In be a bad thing; but for many applications, for example so- social media a user has followers and friends. Tweets gener- cial media user segmentation, we wish to identify “crisp” ally flow from a user to the followers and friends. The social clusters, clusters that have a unique membership. More followers graph and other communities using followers and generally, overlapping clusters are undesirable in that they friends are well studied. But the retweet network where there “fade” the dissimilarity (distinctiveness) between clusters. is a directed edge between two users from source to des- The greater the cluster overlap, the more similar the clus- tination, has received not much attention (Bild et al. 2015). ters become, and the differentiation between clusters deteri- Size, noise and dynamism are dominant research issues with orates. The problem is exacerbated when we have, not two social media (Adedoyin-Olowe, Gaber, and Stahl 2014). A or three overlapping clusters, but many hundreds with vary- user may be present in different social groups or communi- ing degrees of overlap (similarity) as in the case of Twitter ties, that makes overlapping clusters. communities. Many works have been carried out to detect community To derive “crisp” clusters from a set of clusters where one clusters in social media (Lee et al. 2010; Mishra et al. 2007; or more of the clusters overlap it is necessary to remove du- Zhang and Yu 2015; Duan et al. 2014; Gregory 2008; plicate members from individual clusters, using some crite- Whang, Gleich, and Dhillon 2016; Goldberg et al. 2010; ria, so that each cluster becomes unique; a process known Arora et al. 2012; Hou et al. 2015; Dreier et al. 2014; as duplicate removal. In this paper, we have proposed an al- Lancichinetti and Fortunato 2009). Social networking com- gorithm to remove all duplicates from clusters. However by munities are highly overlapped as a node is present in more doing so we may be loosing important information. Ideally than one community. The benchmark algorithms to detect duplicate removal should be conducted in such a way that communities work better when overlapping is minimized information is not lost, or at least the loss is minimized. In (Lee et al. 2010). In the paper (Zhang and Yu 2015) the the case where we have many overlapping clusters there is authors detect community for emerging networks using a also a computational overhead involved, thus we wish our closeness measure “intimacy”. In our case, we have clus- duplicate removal to be conducted in such a way that the ter nodes through retweet or reply links. After duplicate re- number of comparisons that need to be made is minimized. moval we get some unique cluster communities. Unique in In this paper, we thus propose a simple, another algorithm the sense is that it does not follow the complete commu- for the effective derivation of crisp clusters from overlap- nity definition (Arora et al. 2012) in the social network. In ping clusters derived from Twitter data using the medium of the paper (Duan et al. 2014), author has used correlation Retweets. In doing so, we are placing users in groups that is analysis to connect to modularity based methods (Shiokawa, best suited by hierarchy using the retweet/reply links. Fujiwara, and Onizuka 2013; Clauset, Newman, and Moore With respect to the work presented in this paper we con- 2004) for community detection. ceptualize Twitter data in terms of a directed graph where Although there are number of works using seed expan- the vertices represent users and the edges retweets or replies sion(Lee et al. 2010; Whang, Gleich, and Dhillon 2016) for from one user to another. Generally, it is assumed that a detecting overlapping communities but there is no clear un- user “retweets” another user if there is something interesting derstanding which technique is most suitable for a partic- (topical) in a received tweet. Clusters representing commu- ular domain (Kloumann and Kleinberg 2014) and the per- nities can then be generated starting with an individual “tar- formance of community assignment algorithms (Lee et al. get” user, vertex in the graph, and proceeding in a breadth 2010). The paper (Lee et al. 2010) introduced a greedy first manner, level-by-level, up to some pre-specified maxim clique expansion algorithm removing near duplicate com- level (distance from start) l. At each level the vertices are munities using distinct cliques as seed. In (Conover et al. added to the clustered representing the target user. In this 2011) the authors have used network of retweets and men- manner a set of clusters, a cluster configuration, can be pro- tion network to find political alignment. Cluster analysis of duced; one cluster for each target user in given set of tweets. these networks reveal clear segregation. Our approach, fo- However, the resulting set of clusters will feature significant cuses on exact duplicate removal in overlapping clusters to overlap which makes interpretation difficult (as discussed get “crisp” cluster communities by finding a suitable posi- above). Note that clustering users using retweets and replies tion of a user in the group. is different from using Follow links; Follow links are his- torical in nature, whilst retweet and reply links are current. Scope of The Work Hence clusters generated using retweet and reply links tend The work presented in this paper is directed at deriving crisp to be much more current (topical) than clusters generated clusters from overlapping clusters by finding a user’s best using Follow links. suited position. Some or many clusters are overlapping de- pending upon the level of hierarchy. Number of overlap- Related Work ping clusters increases as well as the similarity between the Distinguishing overlapping clusters is difficult due to much clusters, by going up the level. At each level different clus- similarity between the clusters. Our work on overlapping ter sizes are chosen by certain threshold. The problem ad- clusters is based on retweet or reply network (Paul, Dutta, dressed here is removal or elimination of duplicates among and Coenen 2016; Lussier and Chawla 2011) of social me- overlapping clusters. The first algorithm deletes all duplicate dia, Twitter. In our case, majority of duplicates are removed users among the clusters. The algorithm gradually creates a set of unique users by comparing a user from this set with we can expect extensive overlap. For the purpose of the pro- another user from the clusters and simultaneously removes posed algorithm all the cluster formation in the cluster set is user from the cluster. The second algorithm is the modifi- used. Moreover, experiments are also performed by select- cation of the first algorithm where selected duplicates with ing the largest clusters (in terms of number of members) de- certain criteria are not removed since removing all dupli- fined using a threshold τ . For a given maximum level l, the cates will eventually means loss of information. The algo- τ value is such adjusted that it gives top 0.25%, 0.5%, 1.0%, rithms are compared with the Naive algorithm with much 2.0%, 4.0% etc clusters of the total number of clusters. The improved time complexity. selected value of τ thus dictates a minimum clusters size be- low which clusters are not considered for duplicate removal. Problem Formulation The clusters are collection of ”users” or ”members”. We will use these words interchangeably. As noted above the overlapping clusters of interest, with re- Problem: Given a set of n overlapping clusters C = spect to the work presented in this paper, are clusters of Twit- {C1 , C2 , C3 , . . . , Cn }, with maximum level l, delete all du- ter users. The clusters are formed using retweet and repliy plicates in clusters. links between users. The links are traversed in breadth first In this paper we propose using an ”empty bucket” cluster search manner. The sideway links within same layer or level and set of n overlapping clusters. Initially ”Empty bucket” are not taken. Given a retweet graph G = {V, E} where V is empty. Starting from the first cluster in the set the mem- is the vertex or user node and E is the directional edge. A bers are compared to the ”empty bucket” which is gradually user Ui is connected to another user Uj if Uj has retweeted populated by the members from the clusters. The common or replied to user Ui , note that the relationship is unidirec- or duplicate users are deleted from the clusters and the non- tional (as opposed to bidirectional). So, there is an edge E duplicate members are added to the ”empty bucket”. The between Uj to Ui . Thus, starting from a given user we can ”empty bucket” will contain only unique members. place this user and all its immediate neighbours into a sin- Given C = {C1 , C2 , C3 , C4 , . . . , Cn } and E = {} where gle cluster (where a neighbouring users is one connected di- Ci ={U1 , U2 , U3 , U4 , . . . , Um }. If Ci ∩ E = Cs . The dupli- rectly to the current user by a retweet or reply). If two users cates in Cs are deleted from the cluster. If Ci - E = Cu . The are “connected” they are in the same cluster. We can then uncommon members Cu are added to the ”empty bucket”. proceed to the immediate neighbours of the seed user plus If Ci ∩ E = φ. The cluster members are added to the empty one, and so on to some predefined maximum “level” l. If we bucket. assume a set of m Twitter users U = {U1 , U2 , U3 , .., Um } = Problem: Given a set of n overlapping clusters C = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} and the following set of {C1 , C2 , C3 , C4 , . . . , Cn }, with maximum level l, delete connections {11 → 1, 8 → 1, 8 → 2, 9 → 2, 2 → 11, 3 → least significant duplicate users. 11, 6 → 8, 12 → 8, 10 → 9, 2 → 5, 5 → 10} (where Select a level l and τ to adjust the number of top Uj → Ui indicates a Retweet/Reply from user Uj to user clusters. Given an empty bucket E = {} and C = Ui ); then we would get clusters of the form shown in Fig- {C1 , C2 , C3 , C4 , . . . , Cn }. The first step is to populate the ure 1 assuming l = 2 (the root is at level 0, the “base level”). bucket with most significant user Uk . By doing so the algo- The figure shows three clusters with respect to users 1, 2 and rithm reads all the clusters once. Suppose users in clusters 10. The clusters in this case are C1 = {1, 2, 3, 6, 8, 11, 12}, is given by Ui and users in empty bucket is Ue . The empty C2 = {2, 6, 8, 9, 10, 12} and C10 = {2, 5, 10}. A user ap- bucket is filled in the following fashion. pears in a cluster only once; there is no duplicity within a 1. If Ui = Ue and Ui (level) < Ue (level). Replace Ue by Ui . cluster. Each user is allowed to form a cluster and is called ”root user”. 2. If Ui 6= Ue . Put Ui in E. 3. If E ={}. Put Ui in E 1 2 10 In the second step, the bucket with most significant users are compared with all the clusters once. The duplicates are 11 8 8 9 5 deleted from the clusters in the following manner. 1. If Ui = Ue and Ui (level) > Ue (level) and Ui (level) 6= 0. Delete Ui from the cluster. 2 3 6 12 6 12 10 2 2. If Ui = Ue and Ui (level) = Ue (level). Set Ue (level) = −1. (a) C1 (b) C2 (c) C3 To make sure that all the duplicates with this condition is deleted except one. Figure 1: Example clusters for users The Proposed Algorithm From the above simple example we can see a substan- In this section the proposed duplicate removal algorithm tial overlap. Note that each user in each cluster is marked is presented. Recall, with respect to the forging, that using with its “level of appearance” (neighbourhood level). Where some maximum level l we generate a cluster set C describ- a user has several level associated with it the level nearest ing social media (Twitter) users. The set C will include one the root is chosen (the closer to the root the more significant cluster per user and thus feature numerous overlapping clus- a user is deemed to be). Thus, given a real Twitter data set, ters. Only those users who have got even a single retweet or reply message are selected to create cluster. Others are ig- Algorithm 2 Bucket with Most Significant Users nored. Each cluster member (user) is associated with a level INPUT: A Cluster set C, generated using max level l, and of appearance. If a user has several levels associated with it pruned using τ and an empty bucket set E the level nearest to root (target user) will be used. We have OUTPUT: A Bucket set E 0 with most significant users Ue experimented with all the selected users that form clusters. Yet, we have shown the use of threshold τ . In case we have 1: for each user Ui in cluster Ci do even larger data, τ can be used. In that case only the clusters 2: for each user Ue in Bucket do who’s size (in terms of number of members) as defined by 3: if Bucket is Empty then the threshold τ are selected. Most of the clusters are over- 4: put Ui in E lapping. 5: else The pseudo for the first algorithm is given in Algorithm 1. 6: if Ui == Ue and Ui hlevel i < Ue hlevel i The input is a set of clusters C, generated using some max then level l and pruned using the threshold τ . The output is the 7: Replace Ue by Ui cluster set C 0 with all duplicates removed. 8: else 9: put Ui in E Algorithm 1 Delete User Without Condition 10: end if INPUT: A Cluster set C, generated using max level l, and 11: end if pruned using τ and an empty bucket set 12: end for OUTPUT: The Cluster set C 0 with all duplicates removed 13: end for 14: Return E 0 1: for each user Ui in cluster Ci do 2: for each user Ue in Bucket do 3: if Bucket is Empty then appearing further away from the root. This, E 0 is the set of 4: put Ui in Bucket distinct users with its most significant position. The level of 5: else a user is considered for comparing significance. The clusters 6: if Ui == Ue then are traversed only once. Initially the bucket set E is empty. 7: Delete user Ui in Cluster Ci When the algorithm reads the first user in the first cluster, 8: else the bucket is populated. After that, one by one all the users 9: Put Ui in Bucket in all the clusters are read. The user Ui in clusters is com- 10: end if pared to Ue of the bucket set E with their level (position 11: end if of appearance from the root). The nearest user is the user 12: end for which is closer to the root. In the algorithm 2 line 6:9 shows 13: end for the comparison. The user Ui with the less value i.e nearest 14: Return C 0 to the root replaces user Ue from the bucket. The algorithm continues till all the clusters are read. The above algorithm given in Algorithm 1 deletes all the duplicates in the clusters by populating an empty bucket Duplicate Removal With Condition and comparing users in clusters with the bucket users. The The output from the Algorithm 2 is input for the third al- bucket size is the total number of distinct users in the cluster gorithm. Each user Ui in the cluster set is compared to the set. Here, the bucket uses only the user and not the level of bucket set E 0 user Ue . All the users those are least signifi- its appearance. In this algorithm the initial generated clusters cant and level 6= 0 are deleted from the clusters. If a user will be bigger in size than the later ones because initially the is present in both the cluster and the bucket set with same bucket is empty. Nevertheless all the duplicates are removed level then the user is kept in at least one cluster. All other from the cluster set but with a cost. The level of appearance duplicates are deleted. of a user is not used and thus the information in the clusters will be less. The next algorithm is the modification of the above Algo- Evaluation rithm 1 which has two parts: These are discussed in further For the evaluation presented in this section the Geo-tagged detail in the following two subsections, Sub-sections and . Microblog data set2 available from the ARK data repository held at the University of Washington was used. The dataset Generating Most Significant User Bucket holds 377616 Tweets covering all US states and the Dis- This sub-section generates a set of most significant user trict of Columbia (Eisenstein et al. 2010). The dataset fea- bucket E 0 . A user Ui is more significant if it appears near ture 9477 users. From this data set four cluster sets were to the root than the user further away from the root. generated using a range of values for l, the maximum dis- The above Algorithm 2 generates a set of users E 0 which tance from the root, {6, 7, 8, 9}. The users who did not get are most significant in nature. E 0 contains users with its most any retweet or replied messages from any other user or significant position given by level. An user appearing clo- 2 sure to the root is considered more significant than a user http://www.ark.cs.cmu.edu/GeoTwitter. Algorithm 3 Duplicate Removal the comparison of run time of the proposed and modified INPUT: A Cluster set C, generated using max level l, and algorithm with the naive algorithm. pruned using τ and Bucket set E 0 Adding further to the evaluation process, τ is used for OUTPUT:: The set C 0 with most duplicates removed different levels {6, 7, 8, 9}. To explore how the threshold τ effects the process we considered setting τ to a range of 1: for each user Ui in cluster Ci do values in terms of the percentage of top clusters to be re- 2: for each user Ue in Bucket do tained {0.25%, 0.5%, 0.75%, 1.0%, 2.0%} in the cluster set. 3: if Ui == Ue and Ui hlevel i > Ue hlevel i then Table 5 shows the result of different τ values for level 9 only. 4: Delete Ui from cluster Figure 3 shows the run time of naive, proposed and modified 5: if Ui == Ue and Ui hlevel i == Ue hlevel i algorithm by setting different τ values. In level 9 there are then total 7123 clusters. τ is set such that we get certain percent- 6: Set Ue hlevel i == −1 age of top clusters. Thus in the column ”Number of Clus- 7: end if ters”, is the top clusters, each with minimum size greater 8: end if than the numbers mentioned in the column ”Minimum Size 9: end for of Clusters”. 10: end for 11: Return update C 0 self within that time period are not considered for cluster- ing.This produced cluster sets comprised of 7123 clusters (|C| = 7123) respectively. Since there are total 9477 users and the generation of clusters is around 7123, the remaining 2354 are single users with no retweet or reply messages from anyone or self. 7123 number of clusters also contains single user cluster like users who have retweeted themselves.These are 2262 single user clusters out of 7123 clusters. Further, in 7123 clusters total distinct users are 7576 in number. Thus, Figure 2: Comparison of runtime of naive algorithm with 1901 users neither received any retweet or reply message nor proposed and modified algorithm for different levels= they have sent any in the same time period to other users. 6, 7, 8, 9. τ is set to 100% As l increases the average number of members per cluster in the four different cluster sets also increases, and conse- quently the clusters become more diverse but feature greater numbers of duplicates. To analyze the operation of the proposed algorithm we generated cluster sets with different values for l={6, 7, 8, 9}. Total number of clusters generated is 7123 and the total number of distinct users for all levels is 7576. The results are presented in Table 1, Table 2, Table 3 and Table 4. In the tables the “Num. of Clusters” column indicates the num- ber of clusters retained after application of the τ threshold value. Here τ is set to 100% to generate clusters with min- imum size of one user. The “Num. of Distinct Users” col- umn gives the number of distinct users in the retained clus- ter set. This is also the bucket set generated. The following Figure 3: Comparison of runtime of naive algorithm with two columns give the number of duplicates before and after proposed and modified algorithm at level 9 with different τ the proposed duplicate removal process was applied, and the values last column compares the run time of Naive algorithm with proposed and modified algorithm. From, the tables it can be seen that in all cases the proposed algorithm eliminates all the duplicates featured in each of the cluster sets. To high- Analysis and Observation light the advantages that can be gained using the proposed The challenge of duplicate removal in large cluster configu- approach its operation was compared with a naive approach rations that feature a significant amount of overlap, as in the where we compare every cluster in the cluster set C with ev- case of user communities extracted from social media net- ery other cluster in C and remove all duplicates. Moreover, works (such as Twitter), is the resource required to remove the modified algorithm retains certain duplicates and its run- all duplicates. Furthermore, it becomes more complicated if time is also compared. Number of duplicates retain in the selected duplicates are to be retained due to its properties. modified algorithm is shown in the tables. Figure 2 shows In the proposed algorithm a cluster set is read only once and Algorithms Num. Num of Num. Duplicates Num. Duplicates Percentage of Run Time in of Clusters Distinct Users Before Elimination After Elimination Duplicates Retained Seconds Naive 7123 7576 165185 0 0.0 308.593 Proposed 7123 7576 165185 0 0.0 97.593 Modified 7123 7576 165185 8638 5.22 150.766 Table 1: Runtime compared between Naive, Proposed and Modified Algorithms where l = 6 and minimum size of each cluster is 1, τ = 100% Algorithms Num. Num of Num. Duplicates Num. Duplicates Percentage of Run Time in of Clusters Distinct Users Before Elimination After Elimination Duplicates Retained Seconds Naive 7123 7576 246524 0 0.0 451.328 Proposed 7123 7576 246524 0 0.0 139.297 Modified 7123 7576 246524 9152 3.71 217.938 Table 2: Runtime compared between Naive, Proposed and Modified Algorithms where l = 7 and minimum size of each cluster is 1, τ = 100% Algorithms Num. Num of Num. Duplicates Num. Duplicates Percentage of Run Time in of Clusters Distinct Users Before Elimination After Elimination Duplicates Retained Seconds Naive 7123 7576 359383 0 0.0 672.281 Proposed 7123 7576 359383 0 0.0 202.313 Modified 7123 7576 359383 9588 2.66 310.383 Table 3: Runtime compared between Naive, Proposed and Modified Algorithms where l = 8 and minimum size of each cluster is 1, τ = 100% Algorithms Num. Num of Num. Duplicates Num. Duplicates Percentage of Run Time in of Clusters Distinct Users Before Elimination After Elimination Duplicates Retained Seconds Naive 7123 7576 510768 0 0.0 1019.141 Proposed 7123 7576 510768 0 0.0 302.985 Modified 7123 7576 510768 9898 1.93 453.328 Table 4: Runtime compared between Naive, Proposed and Modified Algorithms where l = 9 and minimum size of each cluster is 1, τ = 100% Number Minimum Num of Number of Number of Number of RunTime RunTime RunTime of Size of τ Distinct Duplicates Duplicates After Duplicates After in in in Clusters clusters Users before Elimination Elimination Seconds Seconds Seconds Elimination Naive, Proposed (Modified) Naive Proposed Modified 18 750 0.25% 1687 12847 0,0 719 2.938 2.015 2.578 36 700 0.5% 1846 25751 0,0 1359 6.594 4.484 5.172 71 661 1.0% 1900 49434 0,0 1504 15.063 8.281 10.109 142 560 2.0% 2006 92265 0,0 3790 33.0 17.297 20.140 283 445 4.0% 2196 162289 0,0 3496 75.563 34.000 35.828 355 400 5.0% 2263 192506 0,0 3849 97.422 38.672 46.547 427 371 6.0% 2336 220116 0,0 3967 119.328 45.265 49.859 499 342 7.0% 2422 245718 0,0 4186 144.797 49.875 59.390 571 312 8.0% 2472 269115 0,0 4292 171.89 56.907 67.344 714 268 10.0% 2562 310420 0,0 4203 203.563 67.406 78.719 Table 5: Runtime compared between Naive, Proposed and Modified Algorithms where l = 9 and minimum size of each cluster is set using τ Level Cluster Size Cluster Size Cluster Size Cluster Size Less Than 50 Between 50 and 100 Between 100 and 200 Above 200 6 7108 11 04 0 7 7103 13 06 01 8 7098 15 07 03 9 7094 19 07 03 Table 6: Cluster Size After Elimination all the duplicates are removed. The proposed algorithm does ters represent active users in the cluster configuration if we not take care of the appearance of a user or member by level. do not consider the clusters with one member only. Since, a cluster is generated for each user starting from the In the future, there are few inroads that open up with this root, it is better to keep the root of the cluster. Moreover, our study. The bucket set can be converted into knowledge set of intuition say that certain duplicates will enrich the clusters. individual distinct users which can learn by reading differ- To accomplish this, we modified the proposed algorithm to ent clusters. Thus, the properties of users will be enhanced keep selected duplicates. A user closer to the root user is and also the clusters. The use of τ can be used for much big- more similar to it. The modified algorithm reads the clus- ger data set for close approximation. Another future work of ter set two times. First, the algorithm reads the cluster set to the study is to investigate how physical distance matters be- generate a set of users by level and in the second pass, all the tween users i.e users who are retweeting the post of another clusters are read and compared to the set of users by selec- user. Furthermore, a more detailed study of most prominent tion conditions. This set consists of most significant distinct clusters in the cluster set will open up new avenues. user. Those members fail the conditions are deleted. From Table 1, Table 2, Table 3 and Table 4 it is observed that as the Acknowledgement level l increases the number of duplicates also increases but the percentage of duplicates deleted decreases drastically as This research work is partially supported by the Visves- shown in column ”Percentage of Duplicates Retained”. For varaya PhD scheme, Ministry of Electronics and Informa- level 6 the percentage of duplicates retained is 5.22% where tion Technology, Government of India. as for the level 9 the figure is 1.93%. In the Table 5, top 10% of the clusters gives us 33% of References total users in the cluster set which is around 7576. Moreover, Adedoyin-Olowe, M.; Gaber, M. M.; and Stahl, F. 2014. A after duplicate removal majority of clusters left is of size less survey of data mining techniques for social media analysis. than 50. For example, at level 6, out of 7123 clusters only Journal of Data Mining & Digital Humanities 2014. 15 clusters are of size more than 50. If we bifurcate further then in this only 4 clusters are there which go beyond 100 Arora, S.; Ge, R.; Sachdeva, S.; and Schoenebeck, G. 2012. in size. This is listed in the Table 6 for other levels. Thus, Finding overlapping communities in social networks: To- the root users in the big clusters are those who got retweet ward a rigorous approach. In Proceedings of the 13th ACM and reply messages from maximum other users directly or Conference on Electronic Commerce, EC ’12, 37–54. New through chain of other users. We can see these users as the York, NY, USA: ACM. most prominent users in the cluster set. In general, an in- Barbier, G., and Liu, H. 2011. Data mining in social media. fluencer spreads message to maximum members. But in our In Social network data analytics. Springer. 327–352. case, the root users are those who got maximum retweets Bild, D. R.; Liu, Y.; Dick, R. P.; Mao, Z. M.; and Wallach, or reply messages. Thus “crisp” clusters are generated for a D. S. 2015. Aggregate characterization of user behavior cluster set where overlapping is minimized. in twitter and analysis of the retweet graph. ACM Trans. Internet Technol. 15(1):4:1–4:24. Conclusion Brzozowski, M. J., and Romero, D. M. 2011. Who should Here, we demonstrated methods to eliminate duplicates i follow? recommending people in directed social networks. among numerous overlapping clusters formed using retweet ICWSM. and reply message links among users using Twitter data. The retweet and reply network among users are highly over- Cha, M.; Haddadi, H.; Benevenuto, F.; and Gummadi, K. lapping. The overlapping clusters fade dissimilarity. To get 2010. Measuring user influence in twitter: The million fol- some good clusters or dissimilar clusters to lessen the sim- lower fallacy. In 4th International AAAI Conference on We- ilarity, elimination of duplicates is necessitated. Our meth- blogs and Social Media (ICWSM). ods work much better than the naive algorithm. Moreover, Chandra, S.; Khan, L.; and Muhaya, F. B. 2011. Estimat- using this method we can selectively delete duplicates much ing twitter user location using social interactions–a content faster. This study also shows the generation of ”crisp” clus- based approach. In 2011 IEEE Third International Confer- ters and among them a few prominent clusters, that is, the ence on Privacy, Security, Risk and Trust and 2011 IEEE clusters which are much bigger than other clusters in the set Third International Conference on Social Computing, 838– after duplicate elimination. The retweet/reply network clus- 843. Cheng, Z.; Caverlee, J.; and Lee, K. 2010. You are where Knowledge Discovery and Data Mining, KDD ’14, 1366– you tweet: A content-based approach to geo-locating twitter 1375. New York, NY, USA: ACM. users. In Proceedings of the 19th ACM International Con- Kouloumpis, E.; Wilson, T.; and Moore, J. 2011. Twit- ference on Information and Knowledge Management, num- ter sentiment analysis: The good the bad and the omg! In ber 10 in CIKM ’10, 759–768. New York, NY, USA: ACM. ICWSM. Clauset, A.; Newman, M. E.; and Moore, C. 2004. Finding Kryvasheyeu, Y.; Chen, H.; Obradovich, N.; Moro, E.; community structure in very large networks. Physical review Van Hentenryck, P.; Fowler, J.; and Cebrian, M. 2016. Rapid E 70(6):066111. assessment of disaster damage using social media activity. Conover, M.; Ratkiewicz, J.; Francisco, M.; Goncalves, B.; Science Advances 2(3). Menczer, F.; and Flammini, A. 2011. Political polarization Kwak, H.; Lee, C.; Park, H.; and Moon, S. 2010. What is on twitter. AAAI. twitter, a social network or a news media? In Proceedings Dreier, J.; Kuinke, P.; Przybylski, R.; Reidl, F.; Rossmanith, of the 19th International Conference on World Wide Web, P.; and Sikdar, S. 2014. Overlapping communities in social WWW ’10, 591–600. New York, NY, USA: ACM. networks. CoRR abs/1412.4973. Lancichinetti, A., and Fortunato, S. 2009. Benchmarks Duan, L.; Street, W. N.; Liu, Y.; and Lu, H. 2014. Com- for testing community detection algorithms on directed and munity detection in graphs through correlation. In Proceed- weighted graphs with overlapping communities. Phys. Rev. ings of the 20th ACM SIGKDD International Conference on E 80(1):016118. Knowledge Discovery and Data Mining, KDD ’14, 1376– Lee, C.; Reid, F.; McDaid, A.; and Hurley, N. 2010. De- 1385. New York, NY, USA: ACM. tecting highly overlapping community structure by greedy clique expansion. ArXiv e-prints. Eisenstein, J.; O’Connor, B.; Smith, N. A.; and Xing, E. P. 2010. A latent variable model for geographic lexical vari- Li, N., and Wu, D. D. 2010. Using text mining and sentiment ation. In Proceedings of the 2010 Conference on EMNLP, analysis for online forums hotspot detection and forecast. 1277–1287. Association for Computational Linguistics. Decis. Support Syst. 48(2):354–368. Gloor, P. A.; Krauss, J.; Nann, S.; Fischbach, K.; and Lussier, J. T., and Chawla, N. V. 2011. Network effects on Schoder, D. 2009. Web science 2.0: Identifying trends tweeting. In Proceedings of the 14th International Confer- through semantic social network analysis. In 2009 Interna- ence on Discovery Science, DS’11, 209–220. Berlin, Hei- tional Conference on Computational Science and Engineer- delberg: Springer-Verlag. ing, volume 4, 215–222. Mishra, N.; Schreiber, R.; Stanton, I.; and Tarjan, R. E. 2007. Clustering social networks. In Proceedings of the 5th In- Goldberg, M.; Kelley, S.; Magdon-Ismail, M.; Mertsalov, ternational Conference on Algorithms and Models for the K.; and Wallace, A. 2010. Finding overlapping communi- Web-graph, WAW’07, 56–67. Berlin, Heidelberg: Springer- ties in social networks. In 2010 IEEE Second International Verlag. Conference on Social Computing, 104–113. Naaman, M.; Boase, J.; and Lai, C.-H. 2010. Is it really Gregory, S. 2008. A fast algorithm to find overlapping com- about me?: Message content in social awareness streams. munities in networks. In Machine learning and knowledge In Proceedings of the 2010 ACM Conference on Computer discovery in databases. Springer. 408–423. Supported Cooperative Work, CSCW ’10, 189–192. New Hecht, B.; Hong, L.; Suh, B.; and Chi, E. H. 2011. Tweets York, NY, USA: ACM. from justin bieber’s heart: The dynamics of the location field Paul, A.; Dutta, A.; and Coenen, F. 2016. Cluster of tweet in user profiles. In Proceedings of the SIGCHI Conference users based on optimal set. In 2016 IEEE Region 10 Con- on Human Factors in Computing Systems, CHI ’11, 237– ference (TENCON), 286–290. 246. New York, NY, USA: ACM. Shiokawa, H.; Fujiwara, Y.; and Onizuka, M. 2013. Fast Hou, Y.; Whang, J. J.; Gleich, D. F.; and Dhillon, I. S. algorithm for modularity-based graph clustering. In AAAI, 2015. Non-exhaustive, overlapping clustering via low-rank 1170–1176. semidefinite programming. In Proceedings of the 21th ACM Srivastava, J. 2008. Data mining for social network analysis. SIGKDD International Conference on Knowledge Discov- In 2008 IEEE International Conference on Intelligence and ery and Data Mining, KDD ’15, 427–436. New York, NY, Security Informatics, xxxiii–xxxiv. USA: ACM. Whang, J. J.; Gleich, D. F.; and Dhillon, I. S. 2016. Overlap- Jensen, D., and Neville, J. 2003. Data mining in social ping community detection using neighborhood-inflated seed networks. 287–302. In Dynamic Social Network Modeling expansion. IEEE Transactions on Knowledge and Data En- and Analysis: Workshop Summary and Papers. gineering 28(5):1272–1284. Kiss, C., and Bichler, M. 2008. Identification of influencers Wu, S.; Hofman, J. M.; Mason, W. A.; and Watts, D. J. 2011. - measuring influence in customer networks. Decis. Support Who says what to whom on twitter. WWW ’11, 705–714. Syst. 46(1):233–253. New York, NY, USA: ACM. Kloumann, I. M., and Kleinberg, J. M. 2014. Community Zhang, J., and Yu, P. S. 2015. Community detection for membership identification from small seed sets. In Proceed- emerging networks. In Proceedings of the 2015 SIAM Inter- ings of the 20th ACM SIGKDD International Conference on national Conference on Data Mining, 127–135. SIAM.