Duplicate Removal for Overlapping Clusters: A Study Using Social Media Data

                                                 Amit Paul and Animesh Dutta
               Department of Computer Science and Engineering, NIT Durgapur, India, amitpaul06@gmail.com
               Department of Computer Science and Engineering, NIT Durgapur, India, animeshnit@gmail.com


                            Abstract                                 et al. 2011), identification of influencers (Cha et al. 2010;
                                                                     Kiss and Bichler 2008), identification of communities (Lee
  The social media is a labyrinth of information which when
                                                                     et al. 2010; Mishra et al. 2007; Zhang and Yu 2015; Duan
  uncovers, provides a deep insight into the real-world happen-
  ings. In this study, we use social media Twitter to create user    et al. 2014; Gregory 2008; Whang, Gleich, and Dhillon
  groups or clusters using the retweet and reply directed links.     2016), determination of the geographic location of users
  The main idea behind creating the groups is to figure out a        (by message contents) (Cheng, Caverlee, and Lee 2010;
  user’s best suited place and to generate crisp clusters. Each      Chandra, Khan, and Muhaya 2011) and using user location
  user forms a group and thus numerous overlapping groups            in profile (Hecht et al. 2011), sentiment analysis and opin-
  or clusters are created. To get crisp clusters, we present an      ion mining (Kouloumpis, Wilson, and Moore 2011), deter-
  algorithm for removing duplicates in cluster configurations        mining who is “following” / “friends with” / “connected to”
  that feature a significant amount of overlapping. The idea pre-    whom (Brzozowski and Romero 2011; Kwak et al. 2010),
  sented in this paper is that we consider numerous overlapping      trend identification (Gloor et al. 2009), and “hot spot” de-
  clusters in a cluster set and proceed in a manner where each
                                                                     tection (Li and Wu 2010) (indicating some natural disaster)
  cluster is compared with a set of users. The user set is cre-
  ated from these clusters. The proposed algorithm deletes all       (Kryvasheyeu et al. 2016).
  duplicates and is compared to a naive algorithm. Moreover, a          Generally, there are three ways to analyse Twitter data: the
  modified algorithm is also proposed whereby selected dupli-        social network analysis, content analysis and context analy-
  cates are kept based on most significant position of the user      sis. Many works have been carried out using message con-
  among all clusters in the configuration. This does not guaran-     tent while valuable retweet information is neglected (Bild et
  tee that all duplicates will be removed. But, as shown in the      al. 2015). In this paper, we are considering retweet and re-
  study a majority of duplicates are removed. Both the proposed      ply directed links to identify user groupings or clusters. A
  and modified algorithm are lot faster than the naive one. This     reweet is a forwarded message from a user to his follow-
  domain was selected because its a domain where we wish to
  identify unique user communities (clusters) and where large
                                                                     ers. This is of interest because it tells us who is connected to
  amount of overlap typically exists. After duplicate elimina-       whom, or in Twitter jargon who is “following” whom. More-
  tion, we are left with few clusters which are much bigger in       over, a user in the Twitter network can retweet any other
  size than other clusters in the cluster set.                       user’s tweet and this shows the topical interest of the user
                                                                     who retweets the tweet of another user. This allows us to
                                                                     group (cluster) users, according to whom they are “follow-
                        Introduction                                 ing”, which in turn is of interest with respect to a variety of
Social networking sites generate extensive amounts of data.          socio-economic applications such as recommending follow-
It is generally acknowledged that embedded within this               ers, recommending feeds for tweeting etc. However, unlike
data there is a lot of useful, domain dependent, knowl-              in the case of conventional clustering algorithms, grouping
edge (Adedoyin-Olowe, Gaber, and Stahl 2014). The chal-              users in this way typically results in numerous overlapping
lenge is to identify and extract this knowledge in a man-            clusters (groups of users). Individual Twitter users typically
ner whereby it can be meaningfully utilized. One popular             follow many others, and are typically followed by many oth-
mechanism for attempting to do this is to use data min-              ers. On an average a Twitter user has 208 followers although
ing technology (Srivastava 2008; Jensen and Neville 2003;            the variance is considerable1 . Since a user may be following
Barbier and Liu 2011). Examples of where data mining                 numerous other users he may belong to different commu-
technology has been applied to social network data in-               nities and thus the overlap. Furthermore, Twitter does not
clude: content analysis (Naaman, Boase, and Lai 2010; Wu             require a user to be a follower of someone to retweet their
                                                                     content and thus this also increases the chance of overlap-
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.     ping since a single user can retweet many tweets of other
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of
the AAAI 2019 Spring Symposium on Combining Machine Learn-
                                                                       1
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford                Twitter statistics and facts (August 2016), http://
University, Palo Alto, California, USA, March 25-27, 2019.           expandedramblings.com/index.php/.
users and vice-versa.                                             to get unique groups or communities where overlapping is
   Overlapping clusters (user groupings) may not always           minimized. Our problem is for exact duplicate removal. In
be a bad thing; but for many applications, for example so-        social media a user has followers and friends. Tweets gener-
cial media user segmentation, we wish to identify “crisp”         ally flow from a user to the followers and friends. The social
clusters, clusters that have a unique membership. More            followers graph and other communities using followers and
generally, overlapping clusters are undesirable in that they      friends are well studied. But the retweet network where there
“fade” the dissimilarity (distinctiveness) between clusters.      is a directed edge between two users from source to des-
The greater the cluster overlap, the more similar the clus-       tination, has received not much attention (Bild et al. 2015).
ters become, and the differentiation between clusters deteri-     Size, noise and dynamism are dominant research issues with
orates. The problem is exacerbated when we have, not two          social media (Adedoyin-Olowe, Gaber, and Stahl 2014). A
or three overlapping clusters, but many hundreds with vary-       user may be present in different social groups or communi-
ing degrees of overlap (similarity) as in the case of Twitter     ties, that makes overlapping clusters.
communities.                                                         Many works have been carried out to detect community
   To derive “crisp” clusters from a set of clusters where one    clusters in social media (Lee et al. 2010; Mishra et al. 2007;
or more of the clusters overlap it is necessary to remove du-     Zhang and Yu 2015; Duan et al. 2014; Gregory 2008;
plicate members from individual clusters, using some crite-       Whang, Gleich, and Dhillon 2016; Goldberg et al. 2010;
ria, so that each cluster becomes unique; a process known         Arora et al. 2012; Hou et al. 2015; Dreier et al. 2014;
as duplicate removal. In this paper, we have proposed an al-      Lancichinetti and Fortunato 2009). Social networking com-
gorithm to remove all duplicates from clusters. However by        munities are highly overlapped as a node is present in more
doing so we may be loosing important information. Ideally         than one community. The benchmark algorithms to detect
duplicate removal should be conducted in such a way that          communities work better when overlapping is minimized
information is not lost, or at least the loss is minimized. In    (Lee et al. 2010). In the paper (Zhang and Yu 2015) the
the case where we have many overlapping clusters there is         authors detect community for emerging networks using a
also a computational overhead involved, thus we wish our          closeness measure “intimacy”. In our case, we have clus-
duplicate removal to be conducted in such a way that the          ter nodes through retweet or reply links. After duplicate re-
number of comparisons that need to be made is minimized.          moval we get some unique cluster communities. Unique in
In this paper, we thus propose a simple, another algorithm        the sense is that it does not follow the complete commu-
for the effective derivation of crisp clusters from overlap-      nity definition (Arora et al. 2012) in the social network. In
ping clusters derived from Twitter data using the medium of       the paper (Duan et al. 2014), author has used correlation
Retweets. In doing so, we are placing users in groups that is     analysis to connect to modularity based methods (Shiokawa,
best suited by hierarchy using the retweet/reply links.           Fujiwara, and Onizuka 2013; Clauset, Newman, and Moore
   With respect to the work presented in this paper we con-       2004) for community detection.
ceptualize Twitter data in terms of a directed graph where           Although there are number of works using seed expan-
the vertices represent users and the edges retweets or replies    sion(Lee et al. 2010; Whang, Gleich, and Dhillon 2016) for
from one user to another. Generally, it is assumed that a         detecting overlapping communities but there is no clear un-
user “retweets” another user if there is something interesting    derstanding which technique is most suitable for a partic-
(topical) in a received tweet. Clusters representing commu-       ular domain (Kloumann and Kleinberg 2014) and the per-
nities can then be generated starting with an individual “tar-    formance of community assignment algorithms (Lee et al.
get” user, vertex in the graph, and proceeding in a breadth       2010). The paper (Lee et al. 2010) introduced a greedy
first manner, level-by-level, up to some pre-specified maxim      clique expansion algorithm removing near duplicate com-
level (distance from start) l. At each level the vertices are     munities using distinct cliques as seed. In (Conover et al.
added to the clustered representing the target user. In this      2011) the authors have used network of retweets and men-
manner a set of clusters, a cluster configuration, can be pro-    tion network to find political alignment. Cluster analysis of
duced; one cluster for each target user in given set of tweets.   these networks reveal clear segregation. Our approach, fo-
However, the resulting set of clusters will feature significant   cuses on exact duplicate removal in overlapping clusters to
overlap which makes interpretation difficult (as discussed        get “crisp” cluster communities by finding a suitable posi-
above). Note that clustering users using retweets and replies     tion of a user in the group.
is different from using Follow links; Follow links are his-
torical in nature, whilst retweet and reply links are current.                      Scope of The Work
Hence clusters generated using retweet and reply links tend       The work presented in this paper is directed at deriving crisp
to be much more current (topical) than clusters generated         clusters from overlapping clusters by finding a user’s best
using Follow links.                                               suited position. Some or many clusters are overlapping de-
                                                                  pending upon the level of hierarchy. Number of overlap-
                      Related Work                                ping clusters increases as well as the similarity between the
Distinguishing overlapping clusters is difficult due to much      clusters, by going up the level. At each level different clus-
similarity between the clusters. Our work on overlapping          ter sizes are chosen by certain threshold. The problem ad-
clusters is based on retweet or reply network (Paul, Dutta,       dressed here is removal or elimination of duplicates among
and Coenen 2016; Lussier and Chawla 2011) of social me-           overlapping clusters. The first algorithm deletes all duplicate
dia, Twitter. In our case, majority of duplicates are removed     users among the clusters. The algorithm gradually creates a
set of unique users by comparing a user from this set with          we can expect extensive overlap. For the purpose of the pro-
another user from the clusters and simultaneously removes           posed algorithm all the cluster formation in the cluster set is
user from the cluster. The second algorithm is the modifi-          used. Moreover, experiments are also performed by select-
cation of the first algorithm where selected duplicates with        ing the largest clusters (in terms of number of members) de-
certain criteria are not removed since removing all dupli-          fined using a threshold τ . For a given maximum level l, the
cates will eventually means loss of information. The algo-          τ value is such adjusted that it gives top 0.25%, 0.5%, 1.0%,
rithms are compared with the Naive algorithm with much              2.0%, 4.0% etc clusters of the total number of clusters. The
improved time complexity.                                           selected value of τ thus dictates a minimum clusters size be-
                                                                    low which clusters are not considered for duplicate removal.
                  Problem Formulation                               The clusters are collection of ”users” or ”members”. We will
                                                                    use these words interchangeably.
As noted above the overlapping clusters of interest, with re-          Problem: Given a set of n overlapping clusters C =
spect to the work presented in this paper, are clusters of Twit-    {C1 , C2 , C3 , . . . , Cn }, with maximum level l, delete all du-
ter users. The clusters are formed using retweet and repliy         plicates in clusters.
links between users. The links are traversed in breadth first          In this paper we propose using an ”empty bucket” cluster
search manner. The sideway links within same layer or level         and set of n overlapping clusters. Initially ”Empty bucket”
are not taken. Given a retweet graph G = {V, E} where V             is empty. Starting from the first cluster in the set the mem-
is the vertex or user node and E is the directional edge. A         bers are compared to the ”empty bucket” which is gradually
user Ui is connected to another user Uj if Uj has retweeted         populated by the members from the clusters. The common
or replied to user Ui , note that the relationship is unidirec-     or duplicate users are deleted from the clusters and the non-
tional (as opposed to bidirectional). So, there is an edge E        duplicate members are added to the ”empty bucket”. The
between Uj to Ui . Thus, starting from a given user we can          ”empty bucket” will contain only unique members.
place this user and all its immediate neighbours into a sin-           Given C = {C1 , C2 , C3 , C4 , . . . , Cn } and E = {} where
gle cluster (where a neighbouring users is one connected di-        Ci ={U1 , U2 , U3 , U4 , . . . , Um }. If Ci ∩ E = Cs . The dupli-
rectly to the current user by a retweet or reply). If two users     cates in Cs are deleted from the cluster. If Ci - E = Cu . The
are “connected” they are in the same cluster. We can then           uncommon members Cu are added to the ”empty bucket”.
proceed to the immediate neighbours of the seed user plus           If Ci ∩ E = φ. The cluster members are added to the empty
one, and so on to some predefined maximum “level” l. If we          bucket.
assume a set of m Twitter users U = {U1 , U2 , U3 , .., Um } =         Problem: Given a set of n overlapping clusters C =
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} and the following set of    {C1 , C2 , C3 , C4 , . . . , Cn }, with maximum level l, delete
connections {11 → 1, 8 → 1, 8 → 2, 9 → 2, 2 → 11, 3 →               least significant duplicate users.
11, 6 → 8, 12 → 8, 10 → 9, 2 → 5, 5 → 10} (where                       Select a level l and τ to adjust the number of top
Uj → Ui indicates a Retweet/Reply from user Uj to user              clusters. Given an empty bucket E = {} and C =
Ui ); then we would get clusters of the form shown in Fig-          {C1 , C2 , C3 , C4 , . . . , Cn }. The first step is to populate the
ure 1 assuming l = 2 (the root is at level 0, the “base level”).    bucket with most significant user Uk . By doing so the algo-
The figure shows three clusters with respect to users 1, 2 and      rithm reads all the clusters once. Suppose users in clusters
10. The clusters in this case are C1 = {1, 2, 3, 6, 8, 11, 12},     is given by Ui and users in empty bucket is Ue . The empty
C2 = {2, 6, 8, 9, 10, 12} and C10 = {2, 5, 10}. A user ap-          bucket is filled in the following fashion.
pears in a cluster only once; there is no duplicity within a       1. If Ui = Ue and Ui (level) < Ue (level). Replace Ue by Ui .
cluster. Each user is allowed to form a cluster and is called
”root user”.                                                       2. If Ui 6= Ue . Put Ui in E.
                                                                   3. If E ={}. Put Ui in E
             1                              2             10           In the second step, the bucket with most significant users
                                                                    are compared with all the clusters once. The duplicates are
    11                8            8                 9     5        deleted from the clusters in the following manner.
                                                                   1. If Ui = Ue and Ui (level) > Ue (level) and Ui (level) 6= 0.
                                                                       Delete Ui from the cluster.
2        3        6       12   6       12       10         2
                                                                   2. If Ui = Ue and Ui (level) = Ue (level). Set Ue (level) = −1.
         (a) C1                        (b) C2            (c) C3        To make sure that all the duplicates with this condition is
                                                                       deleted except one.
             Figure 1: Example clusters for users
                                                                                   The Proposed Algorithm
   From the above simple example we can see a substan-             In this section the proposed duplicate removal algorithm
tial overlap. Note that each user in each cluster is marked        is presented. Recall, with respect to the forging, that using
with its “level of appearance” (neighbourhood level). Where        some maximum level l we generate a cluster set C describ-
a user has several level associated with it the level nearest      ing social media (Twitter) users. The set C will include one
the root is chosen (the closer to the root the more significant    cluster per user and thus feature numerous overlapping clus-
a user is deemed to be). Thus, given a real Twitter data set,      ters. Only those users who have got even a single retweet or
reply message are selected to create cluster. Others are ig-       Algorithm 2 Bucket with Most Significant Users
nored. Each cluster member (user) is associated with a level       INPUT: A Cluster set C, generated using max level l, and
of appearance. If a user has several levels associated with it     pruned using τ and an empty bucket set E
the level nearest to root (target user) will be used. We have      OUTPUT: A Bucket set E 0 with most significant users Ue
experimented with all the selected users that form clusters.
Yet, we have shown the use of threshold τ . In case we have         1: for each user Ui in cluster Ci do
even larger data, τ can be used. In that case only the clusters     2:     for each user Ue in Bucket do
who’s size (in terms of number of members) as defined by            3:        if Bucket is Empty then
the threshold τ are selected. Most of the clusters are over-        4:            put Ui in E
lapping.                                                            5:        else
   The pseudo for the first algorithm is given in Algorithm 1.      6:            if Ui == Ue and Ui hlevel i < Ue hlevel i
The input is a set of clusters C, generated using some max                then
level l and pruned using the threshold τ . The output is the        7:                Replace Ue by Ui
cluster set C 0 with all duplicates removed.                        8:            else
                                                                    9:                put Ui in E
Algorithm 1 Delete User Without Condition                          10:            end if
INPUT: A Cluster set C, generated using max level l, and           11:        end if
pruned using τ and an empty bucket set                             12:    end for
OUTPUT: The Cluster set C 0 with all duplicates removed            13: end for
                                                                   14: Return E 0
 1: for each user Ui in cluster Ci do
 2:     for each user Ue in Bucket do
 3:        if Bucket is Empty then                                 appearing further away from the root. This, E 0 is the set of
 4:            put Ui in Bucket                                    distinct users with its most significant position. The level of
 5:        else                                                    a user is considered for comparing significance. The clusters
 6:            if Ui == Ue then                                    are traversed only once. Initially the bucket set E is empty.
 7:                Delete user Ui in Cluster Ci                    When the algorithm reads the first user in the first cluster,
 8:            else                                                the bucket is populated. After that, one by one all the users
 9:                Put Ui in Bucket                                in all the clusters are read. The user Ui in clusters is com-
10:            end if                                              pared to Ue of the bucket set E with their level (position
11:        end if                                                  of appearance from the root). The nearest user is the user
12:     end for                                                    which is closer to the root. In the algorithm 2 line 6:9 shows
13: end for                                                        the comparison. The user Ui with the less value i.e nearest
14: Return C 0                                                     to the root replaces user Ue from the bucket. The algorithm
                                                                   continues till all the clusters are read.
   The above algorithm given in Algorithm 1 deletes all the
duplicates in the clusters by populating an empty bucket           Duplicate Removal With Condition
and comparing users in clusters with the bucket users. The         The output from the Algorithm 2 is input for the third al-
bucket size is the total number of distinct users in the cluster   gorithm. Each user Ui in the cluster set is compared to the
set. Here, the bucket uses only the user and not the level of      bucket set E 0 user Ue . All the users those are least signifi-
its appearance. In this algorithm the initial generated clusters   cant and level 6= 0 are deleted from the clusters. If a user
will be bigger in size than the later ones because initially the   is present in both the cluster and the bucket set with same
bucket is empty. Nevertheless all the duplicates are removed       level then the user is kept in at least one cluster. All other
from the cluster set but with a cost. The level of appearance      duplicates are deleted.
of a user is not used and thus the information in the clusters
will be less.
   The next algorithm is the modification of the above Algo-                               Evaluation
rithm 1 which has two parts: These are discussed in further        For the evaluation presented in this section the Geo-tagged
detail in the following two subsections, Sub-sections and .        Microblog data set2 available from the ARK data repository
                                                                   held at the University of Washington was used. The dataset
Generating Most Significant User Bucket                            holds 377616 Tweets covering all US states and the Dis-
This sub-section generates a set of most significant user          trict of Columbia (Eisenstein et al. 2010). The dataset fea-
bucket E 0 . A user Ui is more significant if it appears near      ture 9477 users. From this data set four cluster sets were
to the root than the user further away from the root.              generated using a range of values for l, the maximum dis-
   The above Algorithm 2 generates a set of users E 0 which        tance from the root, {6, 7, 8, 9}. The users who did not get
are most significant in nature. E 0 contains users with its most   any retweet or replied messages from any other user or
significant position given by level. An user appearing clo-
                                                                      2
sure to the root is considered more significant than a user               http://www.ark.cs.cmu.edu/GeoTwitter.
Algorithm 3 Duplicate Removal                                      the comparison of run time of the proposed and modified
INPUT: A Cluster set C, generated using max level l, and           algorithm with the naive algorithm.
pruned using τ and Bucket set E 0                                     Adding further to the evaluation process, τ is used for
OUTPUT:: The set C 0 with most duplicates removed                  different levels {6, 7, 8, 9}. To explore how the threshold τ
                                                                   effects the process we considered setting τ to a range of
 1: for each user Ui in cluster Ci do                              values in terms of the percentage of top clusters to be re-
 2:     for each user Ue in Bucket do                              tained {0.25%, 0.5%, 0.75%, 1.0%, 2.0%} in the cluster set.
 3:        if Ui == Ue and Ui hlevel i > Ue hlevel i then          Table 5 shows the result of different τ values for level 9 only.
 4:            Delete Ui from cluster                              Figure 3 shows the run time of naive, proposed and modified
 5:            if Ui == Ue and Ui hlevel i == Ue hlevel i          algorithm by setting different τ values. In level 9 there are
    then                                                           total 7123 clusters. τ is set such that we get certain percent-
 6:               Set Ue hlevel i == −1                            age of top clusters. Thus in the column ”Number of Clus-
 7:           end if                                               ters”, is the top clusters, each with minimum size greater
 8:        end if                                                  than the numbers mentioned in the column ”Minimum Size
 9:    end for                                                     of Clusters”.
10: end for
11: Return update C 0


self within that time period are not considered for cluster-
ing.This produced cluster sets comprised of 7123 clusters
(|C| = 7123) respectively. Since there are total 9477 users
and the generation of clusters is around 7123, the remaining
2354 are single users with no retweet or reply messages from
anyone or self. 7123 number of clusters also contains single
user cluster like users who have retweeted themselves.These
are 2262 single user clusters out of 7123 clusters. Further, in
7123 clusters total distinct users are 7576 in number. Thus,       Figure 2: Comparison of runtime of naive algorithm with
1901 users neither received any retweet or reply message nor       proposed and modified algorithm for different levels=
they have sent any in the same time period to other users.         6, 7, 8, 9. τ is set to 100%
As l increases the average number of members per cluster
in the four different cluster sets also increases, and conse-
quently the clusters become more diverse but feature greater
numbers of duplicates.
   To analyze the operation of the proposed algorithm we
generated cluster sets with different values for l={6, 7, 8, 9}.
Total number of clusters generated is 7123 and the total
number of distinct users for all levels is 7576. The results
are presented in Table 1, Table 2, Table 3 and Table 4. In
the tables the “Num. of Clusters” column indicates the num-
ber of clusters retained after application of the τ threshold
value. Here τ is set to 100% to generate clusters with min-
imum size of one user. The “Num. of Distinct Users” col-
umn gives the number of distinct users in the retained clus-
ter set. This is also the bucket set generated. The following      Figure 3: Comparison of runtime of naive algorithm with
two columns give the number of duplicates before and after         proposed and modified algorithm at level 9 with different τ
the proposed duplicate removal process was applied, and the        values
last column compares the run time of Naive algorithm with
proposed and modified algorithm. From, the tables it can be
seen that in all cases the proposed algorithm eliminates all
the duplicates featured in each of the cluster sets. To high-                    Analysis and Observation
light the advantages that can be gained using the proposed         The challenge of duplicate removal in large cluster configu-
approach its operation was compared with a naive approach          rations that feature a significant amount of overlap, as in the
where we compare every cluster in the cluster set C with ev-       case of user communities extracted from social media net-
ery other cluster in C and remove all duplicates. Moreover,        works (such as Twitter), is the resource required to remove
the modified algorithm retains certain duplicates and its run-     all duplicates. Furthermore, it becomes more complicated if
time is also compared. Number of duplicates retain in the          selected duplicates are to be retained due to its properties.
modified algorithm is shown in the tables. Figure 2 shows          In the proposed algorithm a cluster set is read only once and
 Algorithms       Num.              Num of         Num. Duplicates      Num. Duplicates       Percentage of       Run Time in
               of Clusters       Distinct Users   Before Elimination    After Elimination   Duplicates Retained    Seconds
   Naive          7123                7576             165185                   0                   0.0            308.593
  Proposed        7123                7576             165185                   0                   0.0             97.593
  Modified        7123                7576             165185                 8638                 5.22            150.766

Table 1: Runtime compared between Naive, Proposed and Modified Algorithms where l = 6 and minimum size of each cluster
is 1, τ = 100%


 Algorithms       Num.              Num of         Num. Duplicates      Num. Duplicates       Percentage of       Run Time in
               of Clusters       Distinct Users   Before Elimination    After Elimination   Duplicates Retained    Seconds
   Naive          7123                7576             246524                   0                   0.0            451.328
  Proposed        7123                7576             246524                   0                   0.0            139.297
  Modified        7123                7576             246524                 9152                 3.71            217.938

Table 2: Runtime compared between Naive, Proposed and Modified Algorithms where l = 7 and minimum size of each cluster
is 1, τ = 100%


 Algorithms       Num.              Num of         Num. Duplicates      Num. Duplicates       Percentage of       Run Time in
               of Clusters       Distinct Users   Before Elimination    After Elimination   Duplicates Retained    Seconds
   Naive          7123                7576             359383                   0                   0.0            672.281
  Proposed        7123                7576             359383                   0                   0.0            202.313
  Modified        7123                7576             359383                 9588                 2.66            310.383

Table 3: Runtime compared between Naive, Proposed and Modified Algorithms where l = 8 and minimum size of each cluster
is 1, τ = 100%


 Algorithms       Num.              Num of         Num. Duplicates      Num. Duplicates       Percentage of       Run Time in
               of Clusters       Distinct Users   Before Elimination    After Elimination   Duplicates Retained    Seconds
   Naive          7123                7576             510768                   0                   0.0            1019.141
  Proposed        7123                7576             510768                   0                   0.0             302.985
  Modified        7123                7576             510768                 9898                 1.93             453.328

Table 4: Runtime compared between Naive, Proposed and Modified Algorithms where l = 9 and minimum size of each cluster
is 1, τ = 100%


 Number      Minimum                  Num of      Number of       Number of          Number of        RunTime     RunTime    RunTime
    of        Size of        τ        Distinct    Duplicates    Duplicates After   Duplicates After      in           in         in
 Clusters     clusters                 Users         before      Elimination        Elimination       Seconds     Seconds    Seconds
                                                  Elimination   Naive, Proposed      (Modified)        Naive      Proposed   Modified
    18         750       0.25%         1687          12847            0,0                719           2.938        2.015      2.578
    36         700       0.5%          1846          25751            0,0               1359           6.594        4.484      5.172
    71         661       1.0%          1900          49434            0,0               1504           15.063       8.281     10.109
   142         560       2.0%          2006          92265            0,0               3790            33.0       17.297     20.140
   283         445       4.0%          2196         162289            0,0               3496           75.563      34.000     35.828
   355         400       5.0%          2263         192506            0,0               3849           97.422      38.672     46.547
   427         371       6.0%          2336         220116            0,0               3967          119.328      45.265     49.859
   499         342       7.0%          2422         245718            0,0               4186          144.797      49.875     59.390
   571         312       8.0%          2472         269115            0,0               4292           171.89      56.907     67.344
   714         268       10.0%         2562         310420            0,0               4203          203.563      67.406     78.719

Table 5: Runtime compared between Naive, Proposed and Modified Algorithms where l = 9 and minimum size of each cluster
is set using τ
                      Level    Cluster Size         Cluster Size            Cluster Size          Cluster Size
                               Less Than 50      Between 50 and 100     Between 100 and 200       Above 200
                        6          7108                 11                      04                     0
                        7          7103                 13                      06                    01
                        8          7098                 15                      07                    03
                        9          7094                 19                      07                    03

                                               Table 6: Cluster Size After Elimination


all the duplicates are removed. The proposed algorithm does           ters represent active users in the cluster configuration if we
not take care of the appearance of a user or member by level.         do not consider the clusters with one member only.
Since, a cluster is generated for each user starting from the            In the future, there are few inroads that open up with this
root, it is better to keep the root of the cluster. Moreover, our     study. The bucket set can be converted into knowledge set of
intuition say that certain duplicates will enrich the clusters.       individual distinct users which can learn by reading differ-
To accomplish this, we modified the proposed algorithm to             ent clusters. Thus, the properties of users will be enhanced
keep selected duplicates. A user closer to the root user is           and also the clusters. The use of τ can be used for much big-
more similar to it. The modified algorithm reads the clus-            ger data set for close approximation. Another future work of
ter set two times. First, the algorithm reads the cluster set to      the study is to investigate how physical distance matters be-
generate a set of users by level and in the second pass, all the      tween users i.e users who are retweeting the post of another
clusters are read and compared to the set of users by selec-          user. Furthermore, a more detailed study of most prominent
tion conditions. This set consists of most significant distinct       clusters in the cluster set will open up new avenues.
user. Those members fail the conditions are deleted. From
Table 1, Table 2, Table 3 and Table 4 it is observed that as the                         Acknowledgement
level l increases the number of duplicates also increases but
the percentage of duplicates deleted decreases drastically as         This research work is partially supported by the Visves-
shown in column ”Percentage of Duplicates Retained”. For              varaya PhD scheme, Ministry of Electronics and Informa-
level 6 the percentage of duplicates retained is 5.22% where          tion Technology, Government of India.
as for the level 9 the figure is 1.93%.
   In the Table 5, top 10% of the clusters gives us 33% of                                   References
total users in the cluster set which is around 7576. Moreover,        Adedoyin-Olowe, M.; Gaber, M. M.; and Stahl, F. 2014. A
after duplicate removal majority of clusters left is of size less     survey of data mining techniques for social media analysis.
than 50. For example, at level 6, out of 7123 clusters only           Journal of Data Mining & Digital Humanities 2014.
15 clusters are of size more than 50. If we bifurcate further
then in this only 4 clusters are there which go beyond 100            Arora, S.; Ge, R.; Sachdeva, S.; and Schoenebeck, G. 2012.
in size. This is listed in the Table 6 for other levels. Thus,        Finding overlapping communities in social networks: To-
the root users in the big clusters are those who got retweet          ward a rigorous approach. In Proceedings of the 13th ACM
and reply messages from maximum other users directly or               Conference on Electronic Commerce, EC ’12, 37–54. New
through chain of other users. We can see these users as the           York, NY, USA: ACM.
most prominent users in the cluster set. In general, an in-           Barbier, G., and Liu, H. 2011. Data mining in social media.
fluencer spreads message to maximum members. But in our               In Social network data analytics. Springer. 327–352.
case, the root users are those who got maximum retweets               Bild, D. R.; Liu, Y.; Dick, R. P.; Mao, Z. M.; and Wallach,
or reply messages. Thus “crisp” clusters are generated for a          D. S. 2015. Aggregate characterization of user behavior
cluster set where overlapping is minimized.                           in twitter and analysis of the retweet graph. ACM Trans.
                                                                      Internet Technol. 15(1):4:1–4:24.
                        Conclusion
                                                                      Brzozowski, M. J., and Romero, D. M. 2011. Who should
Here, we demonstrated methods to eliminate duplicates                 i follow? recommending people in directed social networks.
among numerous overlapping clusters formed using retweet              ICWSM.
and reply message links among users using Twitter data.
The retweet and reply network among users are highly over-            Cha, M.; Haddadi, H.; Benevenuto, F.; and Gummadi, K.
lapping. The overlapping clusters fade dissimilarity. To get          2010. Measuring user influence in twitter: The million fol-
some good clusters or dissimilar clusters to lessen the sim-          lower fallacy. In 4th International AAAI Conference on We-
ilarity, elimination of duplicates is necessitated. Our meth-         blogs and Social Media (ICWSM).
ods work much better than the naive algorithm. Moreover,              Chandra, S.; Khan, L.; and Muhaya, F. B. 2011. Estimat-
using this method we can selectively delete duplicates much           ing twitter user location using social interactions–a content
faster. This study also shows the generation of ”crisp” clus-         based approach. In 2011 IEEE Third International Confer-
ters and among them a few prominent clusters, that is, the            ence on Privacy, Security, Risk and Trust and 2011 IEEE
clusters which are much bigger than other clusters in the set         Third International Conference on Social Computing, 838–
after duplicate elimination. The retweet/reply network clus-          843.
Cheng, Z.; Caverlee, J.; and Lee, K. 2010. You are where         Knowledge Discovery and Data Mining, KDD ’14, 1366–
you tweet: A content-based approach to geo-locating twitter      1375. New York, NY, USA: ACM.
users. In Proceedings of the 19th ACM International Con-         Kouloumpis, E.; Wilson, T.; and Moore, J. 2011. Twit-
ference on Information and Knowledge Management, num-            ter sentiment analysis: The good the bad and the omg! In
ber 10 in CIKM ’10, 759–768. New York, NY, USA: ACM.             ICWSM.
Clauset, A.; Newman, M. E.; and Moore, C. 2004. Finding          Kryvasheyeu, Y.; Chen, H.; Obradovich, N.; Moro, E.;
community structure in very large networks. Physical review      Van Hentenryck, P.; Fowler, J.; and Cebrian, M. 2016. Rapid
E 70(6):066111.                                                  assessment of disaster damage using social media activity.
Conover, M.; Ratkiewicz, J.; Francisco, M.; Goncalves, B.;       Science Advances 2(3).
Menczer, F.; and Flammini, A. 2011. Political polarization       Kwak, H.; Lee, C.; Park, H.; and Moon, S. 2010. What is
on twitter. AAAI.                                                twitter, a social network or a news media? In Proceedings
Dreier, J.; Kuinke, P.; Przybylski, R.; Reidl, F.; Rossmanith,   of the 19th International Conference on World Wide Web,
P.; and Sikdar, S. 2014. Overlapping communities in social       WWW ’10, 591–600. New York, NY, USA: ACM.
networks. CoRR abs/1412.4973.                                    Lancichinetti, A., and Fortunato, S. 2009. Benchmarks
Duan, L.; Street, W. N.; Liu, Y.; and Lu, H. 2014. Com-          for testing community detection algorithms on directed and
munity detection in graphs through correlation. In Proceed-      weighted graphs with overlapping communities. Phys. Rev.
ings of the 20th ACM SIGKDD International Conference on          E 80(1):016118.
Knowledge Discovery and Data Mining, KDD ’14, 1376–              Lee, C.; Reid, F.; McDaid, A.; and Hurley, N. 2010. De-
1385. New York, NY, USA: ACM.                                    tecting highly overlapping community structure by greedy
                                                                 clique expansion. ArXiv e-prints.
Eisenstein, J.; O’Connor, B.; Smith, N. A.; and Xing, E. P.
2010. A latent variable model for geographic lexical vari-       Li, N., and Wu, D. D. 2010. Using text mining and sentiment
ation. In Proceedings of the 2010 Conference on EMNLP,           analysis for online forums hotspot detection and forecast.
1277–1287. Association for Computational Linguistics.            Decis. Support Syst. 48(2):354–368.
Gloor, P. A.; Krauss, J.; Nann, S.; Fischbach, K.; and           Lussier, J. T., and Chawla, N. V. 2011. Network effects on
Schoder, D. 2009. Web science 2.0: Identifying trends            tweeting. In Proceedings of the 14th International Confer-
through semantic social network analysis. In 2009 Interna-       ence on Discovery Science, DS’11, 209–220. Berlin, Hei-
tional Conference on Computational Science and Engineer-         delberg: Springer-Verlag.
ing, volume 4, 215–222.                                          Mishra, N.; Schreiber, R.; Stanton, I.; and Tarjan, R. E. 2007.
                                                                 Clustering social networks. In Proceedings of the 5th In-
Goldberg, M.; Kelley, S.; Magdon-Ismail, M.; Mertsalov,          ternational Conference on Algorithms and Models for the
K.; and Wallace, A. 2010. Finding overlapping communi-           Web-graph, WAW’07, 56–67. Berlin, Heidelberg: Springer-
ties in social networks. In 2010 IEEE Second International       Verlag.
Conference on Social Computing, 104–113.
                                                                 Naaman, M.; Boase, J.; and Lai, C.-H. 2010. Is it really
Gregory, S. 2008. A fast algorithm to find overlapping com-      about me?: Message content in social awareness streams.
munities in networks. In Machine learning and knowledge          In Proceedings of the 2010 ACM Conference on Computer
discovery in databases. Springer. 408–423.                       Supported Cooperative Work, CSCW ’10, 189–192. New
Hecht, B.; Hong, L.; Suh, B.; and Chi, E. H. 2011. Tweets        York, NY, USA: ACM.
from justin bieber’s heart: The dynamics of the location field   Paul, A.; Dutta, A.; and Coenen, F. 2016. Cluster of tweet
in user profiles. In Proceedings of the SIGCHI Conference        users based on optimal set. In 2016 IEEE Region 10 Con-
on Human Factors in Computing Systems, CHI ’11, 237–             ference (TENCON), 286–290.
246. New York, NY, USA: ACM.                                     Shiokawa, H.; Fujiwara, Y.; and Onizuka, M. 2013. Fast
Hou, Y.; Whang, J. J.; Gleich, D. F.; and Dhillon, I. S.         algorithm for modularity-based graph clustering. In AAAI,
2015. Non-exhaustive, overlapping clustering via low-rank        1170–1176.
semidefinite programming. In Proceedings of the 21th ACM         Srivastava, J. 2008. Data mining for social network analysis.
SIGKDD International Conference on Knowledge Discov-             In 2008 IEEE International Conference on Intelligence and
ery and Data Mining, KDD ’15, 427–436. New York, NY,             Security Informatics, xxxiii–xxxiv.
USA: ACM.
                                                                 Whang, J. J.; Gleich, D. F.; and Dhillon, I. S. 2016. Overlap-
Jensen, D., and Neville, J. 2003. Data mining in social          ping community detection using neighborhood-inflated seed
networks. 287–302. In Dynamic Social Network Modeling            expansion. IEEE Transactions on Knowledge and Data En-
and Analysis: Workshop Summary and Papers.                       gineering 28(5):1272–1284.
Kiss, C., and Bichler, M. 2008. Identification of influencers    Wu, S.; Hofman, J. M.; Mason, W. A.; and Watts, D. J. 2011.
- measuring influence in customer networks. Decis. Support       Who says what to whom on twitter. WWW ’11, 705–714.
Syst. 46(1):233–253.                                             New York, NY, USA: ACM.
Kloumann, I. M., and Kleinberg, J. M. 2014. Community            Zhang, J., and Yu, P. S. 2015. Community detection for
membership identification from small seed sets. In Proceed-      emerging networks. In Proceedings of the 2015 SIAM Inter-
ings of the 20th ACM SIGKDD International Conference on          national Conference on Data Mining, 127–135. SIAM.