<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Duplicate Removal for Overlapping Clusters: A Study Using Social Media Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amit Paul</string-name>
          <email>amitpaul06@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Animesh Dutta</string-name>
          <email>animeshnit@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. Gerber</institution>
          ,
          <addr-line>D. Lenat, F. van Harmelen, P. Clark (Eds.)</addr-line>
          ,
          <institution>Proceedings of the AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering (AAAI-MAKE 2019). Stanford University</institution>
          ,
          <addr-line>Palo Alto, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science and Engineering</institution>
          ,
          <addr-line>NIT Durgapur</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The social media is a labyrinth of information which when uncovers, provides a deep insight into the real-world happenings. In this study, we use social media Twitter to create user groups or clusters using the retweet and reply directed links. The main idea behind creating the groups is to figure out a user's best suited place and to generate crisp clusters. Each user forms a group and thus numerous overlapping groups or clusters are created. To get crisp clusters, we present an algorithm for removing duplicates in cluster configurations that feature a significant amount of overlapping. The idea presented in this paper is that we consider numerous overlapping clusters in a cluster set and proceed in a manner where each cluster is compared with a set of users. The user set is created from these clusters. The proposed algorithm deletes all duplicates and is compared to a naive algorithm. Moreover, a modified algorithm is also proposed whereby selected duplicates are kept based on most significant position of the user among all clusters in the configuration. This does not guarantee that all duplicates will be removed. But, as shown in the study a majority of duplicates are removed. Both the proposed and modified algorithm are lot faster than the naive one. This domain was selected because its a domain where we wish to identify unique user communities (clusters) and where large amount of overlap typically exists. After duplicate elimination, we are left with few clusters which are much bigger in size than other clusters in the cluster set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Social networking sites generate extensive amounts of data.
It is generally acknowledged that embedded within this
data there is a lot of useful, domain dependent,
knowledge
        <xref ref-type="bibr" rid="ref1 ref25">(Adedoyin-Olowe, Gaber, and Stahl 2014)</xref>
        . The
challenge is to identify and extract this knowledge in a
manner whereby it can be meaningfully utilized. One popular
mechanism for attempting to do this is to use data
mining technology
        <xref ref-type="bibr" rid="ref23 ref35 ref42 ref5 ref9">(Srivastava 2008; Jensen and Neville 2003;
Barbier and Liu 2011)</xref>
        . Examples of where data mining
technology has been applied to social network data
include: content analysis
        <xref ref-type="bibr" rid="ref33 ref38 ref45">(Naaman, Boase, and Lai 2010; Wu
et al. 2011)</xref>
        , identification of influencers
        <xref ref-type="bibr" rid="ref24 ref43">(Cha et al. 2010;
Kiss and Bichler 2008)</xref>
        , identification of communities
        <xref ref-type="bibr" rid="ref16 ref20 ref29 ref32 ref36 ref44 ref47">(Lee
et al. 2010; Mishra et al. 2007; Zhang and Yu 2015; Duan
et al. 2014; Gregory 2008; Whang, Gleich, and Dhillon
2016)</xref>
        , determination of the geographic location of users
(by message contents)
        <xref ref-type="bibr" rid="ref11 ref12 ref29 ref32 ref33 ref35 ref5 ref9">(Cheng, Caverlee, and Lee 2010;
Chandra, Khan, and Muhaya 2011)</xref>
        and using user location
in profile
        <xref ref-type="bibr" rid="ref21">(Hecht et al. 2011)</xref>
        , sentiment analysis and
opinion mining
        <xref ref-type="bibr" rid="ref26 ref35 ref5 ref9">(Kouloumpis, Wilson, and Moore 2011)</xref>
        ,
determining who is “following” / “friends with” / “connected to”
whom
        <xref ref-type="bibr" rid="ref29 ref35 ref5 ref9">(Brzozowski and Romero 2011; Kwak et al. 2010)</xref>
        ,
trend identification
        <xref ref-type="bibr" rid="ref18">(Gloor et al. 2009)</xref>
        , and “hot spot”
detection
        <xref ref-type="bibr" rid="ref33">(Li and Wu 2010)</xref>
        (indicating some natural disaster)
        <xref ref-type="bibr" rid="ref27">(Kryvasheyeu et al. 2016)</xref>
        .
      </p>
      <p>
        Generally, there are three ways to analyse Twitter data: the
social network analysis, content analysis and context
analysis. Many works have been carried out using message
content while valuable retweet information is neglected
        <xref ref-type="bibr" rid="ref7">(Bild et
al. 2015)</xref>
        . In this paper, we are considering retweet and
reply directed links to identify user groupings or clusters. A
reweet is a forwarded message from a user to his
followers. This is of interest because it tells us who is connected to
whom, or in Twitter jargon who is “following” whom.
Moreover, a user in the Twitter network can retweet any other
user’s tweet and this shows the topical interest of the user
who retweets the tweet of another user. This allows us to
group (cluster) users, according to whom they are
“following”, which in turn is of interest with respect to a variety of
socio-economic applications such as recommending
followers, recommending feeds for tweeting etc. However, unlike
in the case of conventional clustering algorithms, grouping
users in this way typically results in numerous overlapping
clusters (groups of users). Individual Twitter users typically
follow many others, and are typically followed by many
others. On an average a Twitter user has 208 followers although
the variance is considerable1. Since a user may be following
numerous other users he may belong to different
communities and thus the overlap. Furthermore, Twitter does not
require a user to be a follower of someone to retweet their
content and thus this also increases the chance of
overlapping since a single user can retweet many tweets of other
1Twitter statistics and facts (August 2016), http://
expandedramblings.com/index.php/.
users and vice-versa.
      </p>
      <p>Overlapping clusters (user groupings) may not always
be a bad thing; but for many applications, for example
social media user segmentation, we wish to identify “crisp”
clusters, clusters that have a unique membership. More
generally, overlapping clusters are undesirable in that they
“fade” the dissimilarity (distinctiveness) between clusters.
The greater the cluster overlap, the more similar the
clusters become, and the differentiation between clusters
deteriorates. The problem is exacerbated when we have, not two
or three overlapping clusters, but many hundreds with
varying degrees of overlap (similarity) as in the case of Twitter
communities.</p>
      <p>To derive “crisp” clusters from a set of clusters where one
or more of the clusters overlap it is necessary to remove
duplicate members from individual clusters, using some
criteria, so that each cluster becomes unique; a process known
as duplicate removal. In this paper, we have proposed an
algorithm to remove all duplicates from clusters. However by
doing so we may be loosing important information. Ideally
duplicate removal should be conducted in such a way that
information is not lost, or at least the loss is minimized. In
the case where we have many overlapping clusters there is
also a computational overhead involved, thus we wish our
duplicate removal to be conducted in such a way that the
number of comparisons that need to be made is minimized.
In this paper, we thus propose a simple, another algorithm
for the effective derivation of crisp clusters from
overlapping clusters derived from Twitter data using the medium of
Retweets. In doing so, we are placing users in groups that is
best suited by hierarchy using the retweet/reply links.</p>
      <p>With respect to the work presented in this paper we
conceptualize Twitter data in terms of a directed graph where
the vertices represent users and the edges retweets or replies
from one user to another. Generally, it is assumed that a
user “retweets” another user if there is something interesting
(topical) in a received tweet. Clusters representing
communities can then be generated starting with an individual
“target” user, vertex in the graph, and proceeding in a breadth
first manner, level-by-level, up to some pre-specified maxim
level (distance from start) l. At each level the vertices are
added to the clustered representing the target user. In this
manner a set of clusters, a cluster configuration, can be
produced; one cluster for each target user in given set of tweets.
However, the resulting set of clusters will feature significant
overlap which makes interpretation difficult (as discussed
above). Note that clustering users using retweets and replies
is different from using Follow links; Follow links are
historical in nature, whilst retweet and reply links are current.
Hence clusters generated using retweet and reply links tend
to be much more current (topical) than clusters generated
using Follow links.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Distinguishing overlapping clusters is difficult due to much
similarity between the clusters. Our work on overlapping
clusters is based on retweet or reply network
        <xref ref-type="bibr" rid="ref35 ref40 ref5 ref9">(Paul, Dutta,
and Coenen 2016; Lussier and Chawla 2011)</xref>
        of social
media, Twitter. In our case, majority of duplicates are removed
to get unique groups or communities where overlapping is
minimized. Our problem is for exact duplicate removal. In
social media a user has followers and friends. Tweets
generally flow from a user to the followers and friends. The social
followers graph and other communities using followers and
friends are well studied. But the retweet network where there
is a directed edge between two users from source to
destination, has received not much attention
        <xref ref-type="bibr" rid="ref7">(Bild et al. 2015)</xref>
        .
Size, noise and dynamism are dominant research issues with
social media
        <xref ref-type="bibr" rid="ref1 ref25">(Adedoyin-Olowe, Gaber, and Stahl 2014)</xref>
        . A
user may be present in different social groups or
communities, that makes overlapping clusters.
      </p>
      <p>
        Many works have been carried out to detect community
clusters in social media
        <xref ref-type="bibr" rid="ref15 ref16 ref19 ref20 ref29 ref3 ref30 ref32 ref36 ref44 ref47">(Lee et al. 2010; Mishra et al. 2007;
Zhang and Yu 2015; Duan et al. 2014; Gregory 2008;
Whang, Gleich, and Dhillon 2016; Goldberg et al. 2010;
Arora et al. 2012; Hou et al. 2015; Dreier et al. 2014;
Lancichinetti and Fortunato 2009)</xref>
        . Social networking
communities are highly overlapped as a node is present in more
than one community. The benchmark algorithms to detect
communities work better when overlapping is minimized
        <xref ref-type="bibr" rid="ref29 ref32">(Lee et al. 2010)</xref>
        . In the paper
        <xref ref-type="bibr" rid="ref47">(Zhang and Yu 2015)</xref>
        the
authors detect community for emerging networks using a
closeness measure “intimacy”. In our case, we have
cluster nodes through retweet or reply links. After duplicate
removal we get some unique cluster communities. Unique in
the sense is that it does not follow the complete
community definition
        <xref ref-type="bibr" rid="ref3">(Arora et al. 2012)</xref>
        in the social network. In
the paper
        <xref ref-type="bibr" rid="ref16">(Duan et al. 2014)</xref>
        , author has used correlation
analysis to connect to modularity based methods
        <xref ref-type="bibr" rid="ref13 ref41">(Shiokawa,
Fujiwara, and Onizuka 2013; Clauset, Newman, and Moore
2004)</xref>
        for community detection.
      </p>
      <p>
        Although there are number of works using seed
expansion
        <xref ref-type="bibr" rid="ref29 ref32 ref44">(Lee et al. 2010; Whang, Gleich, and Dhillon 2016)</xref>
        for
detecting overlapping communities but there is no clear
understanding which technique is most suitable for a
particular domain
        <xref ref-type="bibr" rid="ref25">(Kloumann and Kleinberg 2014)</xref>
        and the
performance of community assignment algorithms (Lee et al.
        <xref ref-type="bibr" rid="ref10 ref17">2010). The paper (Lee et al. 2010</xref>
        ) introduced a greedy
clique expansion algorithm removing near duplicate
communities using distinct cliques as seed. In
        <xref ref-type="bibr" rid="ref14">(Conover et al.
2011)</xref>
        the authors have used network of retweets and
mention network to find political alignment. Cluster analysis of
these networks reveal clear segregation. Our approach,
focuses on exact duplicate removal in overlapping clusters to
get “crisp” cluster communities by finding a suitable
position of a user in the group.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Scope of The Work</title>
      <p>The work presented in this paper is directed at deriving crisp
clusters from overlapping clusters by finding a user’s best
suited position. Some or many clusters are overlapping
depending upon the level of hierarchy. Number of
overlapping clusters increases as well as the similarity between the
clusters, by going up the level. At each level different
cluster sizes are chosen by certain threshold. The problem
addressed here is removal or elimination of duplicates among
overlapping clusters. The first algorithm deletes all duplicate
users among the clusters. The algorithm gradually creates a
set of unique users by comparing a user from this set with
another user from the clusters and simultaneously removes
user from the cluster. The second algorithm is the
modification of the first algorithm where selected duplicates with
certain criteria are not removed since removing all
duplicates will eventually means loss of information. The
algorithms are compared with the Naive algorithm with much
improved time complexity.</p>
    </sec>
    <sec id="sec-4">
      <title>Problem Formulation</title>
      <p>As noted above the overlapping clusters of interest, with
respect to the work presented in this paper, are clusters of
Twitter users. The clusters are formed using retweet and repliy
links between users. The links are traversed in breadth first
search manner. The sideway links within same layer or level
are not taken. Given a retweet graph G = fV; Eg where V
is the vertex or user node and E is the directional edge. A
user Ui is connected to another user Uj if Uj has retweeted
or replied to user Ui, note that the relationship is
unidirectional (as opposed to bidirectional). So, there is an edge E
between Uj to Ui. Thus, starting from a given user we can
place this user and all its immediate neighbours into a
single cluster (where a neighbouring users is one connected
directly to the current user by a retweet or reply). If two users
are “connected” they are in the same cluster. We can then
proceed to the immediate neighbours of the seed user plus
one, and so on to some predefined maximum “level” l. If we
assume a set of m Twitter users U = fU1; U2; U3; ::; Umg =
f1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12g and the following set of
connections f11 ! 1; 8 ! 1; 8 ! 2; 9 ! 2; 2 ! 11; 3 !
11; 6 ! 8; 12 ! 8; 10 ! 9; 2 ! 5; 5 ! 10g (where
Uj ! Ui indicates a Retweet/Reply from user Uj to user
Ui); then we would get clusters of the form shown in
Figure 1 assuming l = 2 (the root is at level 0, the “base level”).
The figure shows three clusters with respect to users 1, 2 and
10. The clusters in this case are C1 = f1; 2; 3; 6; 8; 11; 12g,
C2 = f2; 6; 8; 9; 10; 12g and C10 = f2; 5; 10g. A user
appears in a cluster only once; there is no duplicity within a
cluster. Each user is allowed to form a cluster and is called
”root user”.
we can expect extensive overlap. For the purpose of the
proposed algorithm all the cluster formation in the cluster set is
used. Moreover, experiments are also performed by
selecting the largest clusters (in terms of number of members)
defined using a threshold . For a given maximum level l, the
value is such adjusted that it gives top 0:25%, 0:5%, 1:0%,
2:0%, 4:0% etc clusters of the total number of clusters. The
selected value of thus dictates a minimum clusters size
below which clusters are not considered for duplicate removal.
The clusters are collection of ”users” or ”members”. We will
use these words interchangeably.</p>
      <p>Problem: Given a set of n overlapping clusters C =
fC1; C2; C3; : : : ; Cng, with maximum level l, delete all
duplicates in clusters.</p>
      <p>In this paper we propose using an ”empty bucket” cluster
and set of n overlapping clusters. Initially ”Empty bucket”
is empty. Starting from the first cluster in the set the
members are compared to the ”empty bucket” which is gradually
populated by the members from the clusters. The common
or duplicate users are deleted from the clusters and the
nonduplicate members are added to the ”empty bucket”. The
”empty bucket” will contain only unique members.</p>
      <p>Given C = fC1; C2; C3; C4; : : : ; Cng and E = fg where
Ci=fU1; U2; U3; U4; : : : ; Umg. If Ci \ E = Cs. The
duplicates in Cs are deleted from the cluster. If Ci - E = Cu. The
uncommon members Cu are added to the ”empty bucket”.
If Ci \ E = . The cluster members are added to the empty
bucket.</p>
      <p>Problem: Given a set of n overlapping clusters C =
fC1; C2; C3; C4; : : : ; Cng, with maximum level l, delete
least significant duplicate users.</p>
      <p>Select a level l and to adjust the number of top
clusters. Given an empty bucket E = fg and C =
fC1; C2; C3; C4; : : : ; Cng. The first step is to populate the
bucket with most significant user Uk. By doing so the
algorithm reads all the clusters once. Suppose users in clusters
is given by Ui and users in empty bucket is Ue. The empty
bucket is filled in the following fashion.
1. If Ui = Ue and Ui (level) &lt; Ue(level). Replace Ue by Ui.
2. If Ui 6= Ue. Put Ui in E.
3. If E =fg. Put Ui in E</p>
      <p>In the second step, the bucket with most significant users
are compared with all the clusters once. The duplicates are
deleted from the clusters in the following manner.
1. If Ui = Ue and Ui (level) &gt; Ue(level) and Ui (level) 6= 0.</p>
      <p>Delete Ui from the cluster.
2. If Ui = Ue and Ui (level) = Ue(level). Set Ue (level) = 1.</p>
      <p>To make sure that all the duplicates with this condition is
deleted except one.</p>
    </sec>
    <sec id="sec-5">
      <title>The Proposed Algorithm</title>
      <p>In this section the proposed duplicate removal algorithm
is presented. Recall, with respect to the forging, that using
some maximum level l we generate a cluster set C
describing social media (Twitter) users. The set C will include one
cluster per user and thus feature numerous overlapping
clusters. Only those users who have got even a single retweet or
2
11</p>
      <p>1
3
(a) C1</p>
      <p>From the above simple example we can see a
substantial overlap. Note that each user in each cluster is marked
with its “level of appearance” (neighbourhood level). Where
a user has several level associated with it the level nearest
the root is chosen (the closer to the root the more significant
a user is deemed to be). Thus, given a real Twitter data set,
8
8
9
6
12
6</p>
      <p>2
12
(b) C2
10
10
5
2
(c) C3
reply message are selected to create cluster. Others are
ignored. Each cluster member (user) is associated with a level
of appearance. If a user has several levels associated with it
the level nearest to root (target user) will be used. We have
experimented with all the selected users that form clusters.
Yet, we have shown the use of threshold . In case we have
even larger data, can be used. In that case only the clusters
who’s size (in terms of number of members) as defined by
the threshold are selected. Most of the clusters are
overlapping.</p>
      <p>The pseudo for the first algorithm is given in Algorithm 1.
The input is a set of clusters C, generated using some max
level l and pruned using the threshold . The output is the
cluster set C0 with all duplicates removed.</p>
      <sec id="sec-5-1">
        <title>Algorithm 1 Delete User Without Condition</title>
        <p>INPUT: A Cluster set C, generated using max level l, and
pruned using and an empty bucket set
OUTPUT: The Cluster set C0 with all duplicates removed
1: for each user Ui in cluster Ci do
2: for each user Ue in Bucket do
3: if Bucket is Empty then
4: put Ui in Bucket
5: else
6:
7:
8: else
9: Put Ui in Bucket
10: end if
11: end if
12: end for
13: end for
14: Return C0
if Ui == Ue then</p>
        <p>Delete user Ui in Cluster Ci</p>
        <p>The above algorithm given in Algorithm 1 deletes all the
duplicates in the clusters by populating an empty bucket
and comparing users in clusters with the bucket users. The
bucket size is the total number of distinct users in the cluster
set. Here, the bucket uses only the user and not the level of
its appearance. In this algorithm the initial generated clusters
will be bigger in size than the later ones because initially the
bucket is empty. Nevertheless all the duplicates are removed
from the cluster set but with a cost. The level of appearance
of a user is not used and thus the information in the clusters
will be less.</p>
        <p>The next algorithm is the modification of the above
Algorithm 1 which has two parts: These are discussed in further
detail in the following two subsections, Sub-sections and .</p>
        <sec id="sec-5-1-1">
          <title>Generating Most Significant User Bucket</title>
          <p>This sub-section generates a set of most significant user
bucket E0. A user Ui is more significant if it appears near
to the root than the user further away from the root.</p>
          <p>The above Algorithm 2 generates a set of users E0 which
are most significant in nature. E0 contains users with its most
significant position given by level. An user appearing
closure to the root is considered more significant than a user</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Algorithm 2 Bucket with Most Significant Users</title>
        <p>INPUT: A Cluster set C, generated using max level l, and
pruned using and an empty bucket set E
OUTPUT: A Bucket set E0 with most significant users Ue
1: for each user Ui in cluster Ci do
2: for each user Ue in Bucket do
3: if Bucket is Empty then
4: put Ui in E
5: else
6:</p>
        <p>then
7:
8: else
9: put Ui in E
10: end if
11: end if
12: end for
13: end for
14: Return E0
if Ui == Ue and Ui hlevel i &lt; Ue hlevel i</p>
      </sec>
      <sec id="sec-5-3">
        <title>Replace Ue by Ui</title>
        <p>appearing further away from the root. This, E0 is the set of
distinct users with its most significant position. The level of
a user is considered for comparing significance. The clusters
are traversed only once. Initially the bucket set E is empty.
When the algorithm reads the first user in the first cluster,
the bucket is populated. After that, one by one all the users
in all the clusters are read. The user Ui in clusters is
compared to Ue of the bucket set E with their level (position
of appearance from the root). The nearest user is the user
which is closer to the root. In the algorithm 2 line 6:9 shows
the comparison. The user Ui with the less value i.e nearest
to the root replaces user Ue from the bucket. The algorithm
continues till all the clusters are read.</p>
        <sec id="sec-5-3-1">
          <title>Duplicate Removal With Condition</title>
          <p>The output from the Algorithm 2 is input for the third
algorithm. Each user Ui in the cluster set is compared to the
bucket set E0 user Ue. All the users those are least
significant and level 6= 0 are deleted from the clusters. If a user
is present in both the cluster and the bucket set with same
level then the user is kept in at least one cluster. All other
duplicates are deleted.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation</title>
      <p>
        For the evaluation presented in this section the Geo-tagged
Microblog data set2 available from the ARK data repository
held at the University of Washington was used. The dataset
holds 377616 Tweets covering all US states and the
District of Columbia (Eisenste
        <xref ref-type="bibr" rid="ref39">in et al. 2010</xref>
        ). The dataset
feature 9477 users. From this data set four cluster sets were
generated using a range of values for l, the maximum
distance from the root, f6; 7; 8; 9g. The users who did not get
any retweet or replied messages from any other user or
2http://www.ark.cs.cmu.edu/GeoTwitter.
      </p>
      <sec id="sec-6-1">
        <title>Algorithm 3 Duplicate Removal</title>
        <p>INPUT: A Cluster set C, generated using max level l, and
pruned using and Bucket set E0
OUTPUT:: The set C0 with most duplicates removed
then
1: for each user Ui in cluster Ci do
2: for each user Ue in Bucket do
3: if Ui == Ue and Ui hlevel i &gt; Ue hlevel i then
4: Delete Ui from cluster
5: if Ui == Ue and Ui hlevel i == Ue hlevel i
6: Set Uehlevel i ==
7: end if
8: end if
9: end for
10: end for
11: Return update C0
1
self within that time period are not considered for
clustering.This produced cluster sets comprised of 7123 clusters
(jCj = 7123) respectively. Since there are total 9477 users
and the generation of clusters is around 7123, the remaining
2354 are single users with no retweet or reply messages from
anyone or self. 7123 number of clusters also contains single
user cluster like users who have retweeted themselves.These
are 2262 single user clusters out of 7123 clusters. Further, in
7123 clusters total distinct users are 7576 in number. Thus,
1901 users neither received any retweet or reply message nor
they have sent any in the same time period to other users.
As l increases the average number of members per cluster
in the four different cluster sets also increases, and
consequently the clusters become more diverse but feature greater
numbers of duplicates.</p>
        <p>To analyze the operation of the proposed algorithm we
generated cluster sets with different values for l=f6; 7; 8; 9g.
Total number of clusters generated is 7123 and the total
number of distinct users for all levels is 7576. The results
are presented in Table 1, Table 2, Table 3 and Table 4. In
the tables the “Num. of Clusters” column indicates the
number of clusters retained after application of the threshold
value. Here is set to 100% to generate clusters with
minimum size of one user. The “Num. of Distinct Users”
column gives the number of distinct users in the retained
cluster set. This is also the bucket set generated. The following
two columns give the number of duplicates before and after
the proposed duplicate removal process was applied, and the
last column compares the run time of Naive algorithm with
proposed and modified algorithm. From, the tables it can be
seen that in all cases the proposed algorithm eliminates all
the duplicates featured in each of the cluster sets. To
highlight the advantages that can be gained using the proposed
approach its operation was compared with a naive approach
where we compare every cluster in the cluster set C with
every other cluster in C and remove all duplicates. Moreover,
the modified algorithm retains certain duplicates and its
runtime is also compared. Number of duplicates retain in the
modified algorithm is shown in the tables. Figure 2 shows
the comparison of run time of the proposed and modified
algorithm with the naive algorithm.</p>
        <p>Adding further to the evaluation process, is used for
different levels f6; 7; 8; 9g. To explore how the threshold
effects the process we considered setting to a range of
values in terms of the percentage of top clusters to be
retained f0:25%; 0:5%; 0:75%; 1:0%; 2:0%g in the cluster set.
Table 5 shows the result of different values for level 9 only.
Figure 3 shows the run time of naive, proposed and modified
algorithm by setting different values. In level 9 there are
total 7123 clusters. is set such that we get certain
percentage of top clusters. Thus in the column ”Number of
Clusters”, is the top clusters, each with minimum size greater
than the numbers mentioned in the column ”Minimum Size
of Clusters”.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Analysis and Observation</title>
      <p>The challenge of duplicate removal in large cluster
configurations that feature a significant amount of overlap, as in the
case of user communities extracted from social media
networks (such as Twitter), is the resource required to remove
all duplicates. Furthermore, it becomes more complicated if
selected duplicates are to be retained due to its properties.
In the proposed algorithm a cluster set is read only once and</p>
      <p>Num. Duplicates
Before Elimination
165185
165185
165185
all the duplicates are removed. The proposed algorithm does
not take care of the appearance of a user or member by level.
Since, a cluster is generated for each user starting from the
root, it is better to keep the root of the cluster. Moreover, our
intuition say that certain duplicates will enrich the clusters.
To accomplish this, we modified the proposed algorithm to
keep selected duplicates. A user closer to the root user is
more similar to it. The modified algorithm reads the
cluster set two times. First, the algorithm reads the cluster set to
generate a set of users by level and in the second pass, all the
clusters are read and compared to the set of users by
selection conditions. This set consists of most significant distinct
user. Those members fail the conditions are deleted. From
Table 1, Table 2, Table 3 and Table 4 it is observed that as the
level l increases the number of duplicates also increases but
the percentage of duplicates deleted decreases drastically as
shown in column ”Percentage of Duplicates Retained”. For
level 6 the percentage of duplicates retained is 5:22% where
as for the level 9 the figure is 1:93%.</p>
      <p>In the Table 5, top 10% of the clusters gives us 33% of
total users in the cluster set which is around 7576. Moreover,
after duplicate removal majority of clusters left is of size less
than 50. For example, at level 6, out of 7123 clusters only
15 clusters are of size more than 50. If we bifurcate further
then in this only 4 clusters are there which go beyond 100
in size. This is listed in the Table 6 for other levels. Thus,
the root users in the big clusters are those who got retweet
and reply messages from maximum other users directly or
through chain of other users. We can see these users as the
most prominent users in the cluster set. In general, an
influencer spreads message to maximum members. But in our
case, the root users are those who got maximum retweets
or reply messages. Thus “crisp” clusters are generated for a
cluster set where overlapping is minimized.</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>Here, we demonstrated methods to eliminate duplicates
among numerous overlapping clusters formed using retweet
and reply message links among users using Twitter data.
The retweet and reply network among users are highly
overlapping. The overlapping clusters fade dissimilarity. To get
some good clusters or dissimilar clusters to lessen the
similarity, elimination of duplicates is necessitated. Our
methods work much better than the naive algorithm. Moreover,
using this method we can selectively delete duplicates much
faster. This study also shows the generation of ”crisp”
clusters and among them a few prominent clusters, that is, the
clusters which are much bigger than other clusters in the set
after duplicate elimination. The retweet/reply network
clusters represent active users in the cluster configuration if we
do not consider the clusters with one member only.</p>
      <p>In the future, there are few inroads that open up with this
study. The bucket set can be converted into knowledge set of
individual distinct users which can learn by reading
different clusters. Thus, the properties of users will be enhanced
and also the clusters. The use of can be used for much
bigger data set for close approximation. Another future work of
the study is to investigate how physical distance matters
between users i.e users who are retweeting the post of another
user. Furthermore, a more detailed study of most prominent
clusters in the cluster set will open up new avenues.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgement</title>
      <p>This research work is partially supported by the
Visvesvaraya PhD scheme, Ministry of Electronics and
Information Technology, Government of India.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Adedoyin-Olowe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gaber</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>M.;</article-title>
          and
          <string-name>
            <surname>Stahl</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>A survey of data mining techniques for social media analysis</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>Journal of Data Mining &amp; Digital Humanities</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Arora</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Ge,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Sachdeva,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; and Schoenebeck,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>Finding overlapping communities in social networks: Toward a rigorous approach</article-title>
          .
          <source>In Proceedings of the 13th ACM Conference on Electronic Commerce</source>
          , EC '
          <volume>12</volume>
          ,
          <fpage>37</fpage>
          -
          <lpage>54</lpage>
          . New York, NY, USA: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Barbier</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and Liu,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>Data mining in social media</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>In Social network data analytics</article-title>
          . Springer.
          <fpage>327</fpage>
          -
          <lpage>352</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Bild</surname>
            ,
            <given-names>D. R.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Dick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            ;
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. M.</given-names>
            ; and
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. S.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Aggregate characterization of user behavior in twitter and analysis of the retweet graph</article-title>
          .
          <source>ACM Trans.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Internet</given-names>
            <surname>Technol</surname>
          </string-name>
          .
          <volume>15</volume>
          (
          <issue>1</issue>
          ):4:
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          :
          <fpage>24</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Brzozowski</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Romero</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Who should i follow? recommending people in directed social networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2010.
          <article-title>Measuring user influence in twitter: The million follower fallacy</article-title>
          .
          <source>In 4th International AAAI Conference on Weblogs and Social Media (ICWSM).</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Chandra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Muhaya</surname>
            ,
            <given-names>F. B.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Estimating twitter user location using social interactions-a content based approach</article-title>
          .
          <source>In 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing</source>
          ,
          <fpage>838</fpage>
          -
          <lpage>843</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Cheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Caverlee</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>You are where you tweet: A content-based approach to geo-locating twitter users</article-title>
          .
          <source>In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, number 10 in CIKM '10</source>
          ,
          <fpage>759</fpage>
          -
          <lpage>768</lpage>
          . New York, NY, USA: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Clauset</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Finding community structure in very large networks</article-title>
          .
          <source>Physical review E</source>
          <volume>70</volume>
          (
          <issue>6</issue>
          ):
          <fpage>066111</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Conover</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ratkiewicz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Francisco,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Goncalves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Menczer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ; and
            <surname>Flammini</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>Political polarization on twitter</article-title>
          .
          <source>AAAI.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Dreier</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Kuinke,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Przybylski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ;
            <surname>Reidl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ;
            <surname>Rossmanith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Sikdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Overlapping communities in social networks</article-title>
          .
          <source>CoRR abs/1412</source>
          .4973.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Street</surname>
            ,
            <given-names>W. N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and Lu,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Community detection in graphs through correlation</article-title>
          .
          <source>In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '
          <volume>14</volume>
          ,
          <fpage>1376</fpage>
          -
          <lpage>1385</lpage>
          . New York, NY, USA: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          2010.
          <article-title>A latent variable model for geographic lexical variation</article-title>
          .
          <source>In Proceedings of the 2010 Conference on EMNLP</source>
          ,
          <fpage>1277</fpage>
          -
          <lpage>1287</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Gloor</surname>
            ,
            <given-names>P. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Krauss</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Nann,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Fischbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ; and
            <surname>Schoder</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2009</year>
          .
          <article-title>Web science 2.0: Identifying trends through semantic social network analysis</article-title>
          .
          <source>In 2009 International Conference on Computational Science and Engineering</source>
          , volume
          <volume>4</volume>
          ,
          <fpage>215</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kelley</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Magdon-Ismail</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mertsalov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wallace</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Finding overlapping communities in social networks</article-title>
          .
          <source>In 2010 IEEE Second International Conference on Social Computing</source>
          ,
          <fpage>104</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Gregory</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>A fast algorithm to find overlapping communities in networks</article-title>
          .
          <source>In Machine learning and knowledge discovery in databases. Springer. 408-423.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Hecht</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Suh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>E. H.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Tweets from justin bieber's heart: The dynamics of the location field in user profiles</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems</source>
          , CHI '
          <volume>11</volume>
          ,
          <fpage>237</fpage>
          -
          <lpage>246</lpage>
          . New York, NY, USA: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          2015.
          <article-title>Non-exhaustive, overlapping clustering via low-rank semidefinite programming</article-title>
          .
          <source>In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '
          <volume>15</volume>
          ,
          <fpage>427</fpage>
          -
          <lpage>436</lpage>
          . New York, NY, USA: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Neville</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Data mining in social networks</article-title>
          .
          <volume>287</volume>
          -
          <fpage>302</fpage>
          .
          <source>In Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Kiss</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bichler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Identification of influencers - measuring influence in customer networks</article-title>
          .
          <source>Decis. Support Syst</source>
          .
          <volume>46</volume>
          (
          <issue>1</issue>
          ):
          <fpage>233</fpage>
          -
          <lpage>253</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Kloumann</surname>
            ,
            <given-names>I. M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kleinberg</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Community membership identification from small seed sets</article-title>
          .
          <source>In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '
          <volume>14</volume>
          ,
          <fpage>1366</fpage>
          -
          <lpage>1375</lpage>
          . New York, NY, USA: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Kouloumpis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ; Wilson, T.; and
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Twitter sentiment analysis: The good the bad and the omg! In ICWSM</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Kryvasheyeu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; Obradovich,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Moro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.; Van</given-names>
            <surname>Hentenryck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Fowler</surname>
          </string-name>
          , J.; and Cebrian,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Rapid assessment of disaster damage using social media activity</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>Science Advances</source>
          <volume>2</volume>
          (
          <issue>3</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Kwak</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Park</surname>
            , H.; and Moon,
            <given-names>S.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>What is twitter, a social network or a news media</article-title>
          ?
          <source>In Proceedings of the 19th International Conference on World Wide Web, WWW '10</source>
          ,
          <fpage>591</fpage>
          -
          <lpage>600</lpage>
          . New York, NY, USA: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Lancichinetti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fortunato</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities</article-title>
          .
          <source>Phys. Rev.</source>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>E</source>
          <volume>80</volume>
          (
          <issue>1</issue>
          ):
          <fpage>016118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Reid</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>McDaid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Hurley</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Detecting highly overlapping community structure by greedy clique expansion</article-title>
          .
          <source>ArXiv</source>
          e-prints.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Using text mining and sentiment analysis for online forums hotspot detection and forecast</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>Decis. Support</given-names>
            <surname>Syst</surname>
          </string-name>
          .
          <volume>48</volume>
          (
          <issue>2</issue>
          ):
          <fpage>354</fpage>
          -
          <lpage>368</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Lussier</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>N. V.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Network effects on tweeting</article-title>
          .
          <source>In Proceedings of the 14th International Conference on Discovery Science</source>
          , DS'
          <volume>11</volume>
          ,
          <fpage>209</fpage>
          -
          <lpage>220</lpage>
          . Berlin, Heidelberg: Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Mishra</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schreiber</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Stanton,
          <string-name>
            <surname>I.;</surname>
          </string-name>
          and Tarjan,
          <string-name>
            <surname>R. E.</surname>
          </string-name>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <article-title>Clustering social networks</article-title>
          .
          <source>In Proceedings of the 5th International Conference on Algorithms and Models for the Web-graph, WAW'07</source>
          ,
          <fpage>56</fpage>
          -
          <lpage>67</lpage>
          . Berlin, Heidelberg: SpringerVerlag.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Naaman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Boase</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>C.-H.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Is it really about me?: Message content in social awareness streams</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work</source>
          , CSCW '
          <volume>10</volume>
          ,
          <fpage>189</fpage>
          -
          <lpage>192</lpage>
          . New York, NY, USA: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Paul</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dutta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Coenen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Cluster of tweet users based on optimal set</article-title>
          .
          <source>In 2016 IEEE Region 10 Conference (TENCON)</source>
          ,
          <fpage>286</fpage>
          -
          <lpage>290</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>Shiokawa</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; Fujiwara,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Onizuka,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Fast algorithm for modularity-based graph clustering</article-title>
          .
          <source>In AAAI</source>
          ,
          <fpage>1170</fpage>
          -
          <lpage>1176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Data mining for social network analysis.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <source>In 2008 IEEE International Conference on Intelligence and Security Informatics</source>
          , xxxiii-xxxiv.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>Whang</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gleich</surname>
            ,
            <given-names>D. F.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Dhillon</surname>
            ,
            <given-names>I. S.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Overlapping community detection using neighborhood-inflated seed expansion</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>28</volume>
          (
          <issue>5</issue>
          ):
          <fpage>1272</fpage>
          -
          <lpage>1284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hofman</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mason</surname>
            ,
            <given-names>W. A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Watts</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <article-title>Who says what to whom on twitter</article-title>
          .
          <source>WWW '11</source>
          ,
          <fpage>705</fpage>
          -
          <lpage>714</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J., and
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Community detection for emerging networks</article-title>
          .
          <source>In Proceedings of the 2015 SIAM International Conference on Data Mining</source>
          ,
          <fpage>127</fpage>
          -
          <lpage>135</lpage>
          . SIAM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>