<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Finding Communities in Site Web-Graphs and Citation Graphs</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Antonis Sidiropoulos</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Engineering Lab, Department of Informatics, Aristotle University</institution>
          ,
          <addr-line>Thessaloniki, 54124</addr-line>
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Web is a typical example of a social network. One of the most intriguing features of the Web is its self-organization behavior, which is usually faced through the existence of communities. The discovery of the communities in a Web-graph can be used to improve the e ectiveness of search engines, for purposes of prefetching, bibliographic citation ranking, spam detection, creation of road-maps and site graphs, etc. Correspondingly, a citation graph is also a social network which consists of communities. The identi cation of communities in citation graphs can enhance the bibliography search as well as the data-mining. In this paper we will present a fast algorithm which can identify the communities over a given unweighted/undirected graph. This graph may represent a Web-graph or a citation graph.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>During the past decade the World Wide Web became the most popular network
in the World. WWW grows with a very fast speed, thus the information that can
be found through it is huge. Two are the main problems for the Web. How to nd
information and how to get it fast. For the former, several solutions have been
presented over last years. From the ordering by the keyword frequency we have
moved to the Link Analysis Ranking Algorithms (LAR). LAR Algorithms gave
an admissible solution for the problem of searching. For the latter, the proxy
servers and the Content Distribution Networks gave a breath to the problem of
speed. However, the Web is still growing fast, the web-sites are huge and usually
semi-automatically generated. So the above areas of research need to enhance
their propositions.</p>
      <p>
        On the other hand, the Web is a characteristic example of a social network
Socials networks are usually abstracted as graphs, comprised by vertices, edges
(directed or not) and in some cases, with weights on these edges. Social networks
have been studied long before the conception of the Web. Pioneering works for
the characterization of the Web as a social network and for the study of its basic
properties are due to the work of Barabasi and its colleagues [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Later, several
studies investigated other aspects, like its scale-free nature [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], its growth [
        <xref ref-type="bibr" rid="ref22 ref3">22, 3</xref>
        ],
etc.
      </p>
      <p>
        One of the most intriguing features of the Web, and of other social networks
as well, is its self-organization behavior, which is usually faced through the
existence of communities. Groups of vertices that have high density of edges within
them and lower density of edges between groups is a frequent informal de nition
of a community. The notion of a community is very useful from a practical
perspective, because it can be used to improve the e ectiveness of search engines
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], for purposes of prefetching [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], bibliographic citation ranking [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], spam
detection, creation of road-maps and site graphs, etc.
      </p>
      <p>In addition, the discovery of communities in citation graphs will also help to
facilitate and enhance the bibliography search. For example, it could be possible
to nd relevant documents even if there are no common keywords and no direct
citation between them. Also it will be possible to nd authors with the same
interests as well as communities of authors working on the same scienti c domain.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The notion of a Web community is not very strict; it is generally described as
a substructure (subset of vertices) of a graph with dense linkage between the
members of the community and sparse density outside the community. The
existence of communities in the Web was rst reported in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The aforementioned
qualitative de nition though is not adequate when trying to devise algorithms
for the determination of communities in Web graphs. Thus, we need sharper,
quantitative de nitions for the communities.
      </p>
      <p>
        In order to provide such a quantitative de nition, we need to introduce some
\quantities" The basic quantity to consider is di, the degree of a generic node i of
the considered undirected graph G (representing the examined network), which,
in terms of its adjacency matrix A[i; j], is di = Pj A[i; j]. If we consider a
subgraph V G, that node i belongs to it, we can split the total degree d in two
quantities: di(V ) = diin(V ) + diout(V ). The rst term of the summation denotes
the number of edges that connect node i to other nodes which belong to V , i.e.,
din = Pj2V A[i; j]. The second term of the summation formula denotes the
numi
ber of connections toward nodes in the rest of the graph, i.e., diout = Pj2=V A[i; j].
The rst de nition of communities is due to Flake [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], who de ned a
community as the set of nodes C(C G) such that diin(C) &gt; diout(C)8i 2 C.
      </p>
      <p>
        In general, we may give many di erent quantitative de nitions of a
community, which depend on the context of the application where it is developed. The
structure of a community can be encountered at various scales in the Web. The
most thoroughly investigated are the inter-site communities, which span several
Web sites, and usually de ne broad thematic areas determined by a set of
keywords, e.g., the 9/11 community [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The notions of compound documents [
        <xref ref-type="bibr" rid="ref7 ref8">8, 7</xref>
        ]
and logical information units [
        <xref ref-type="bibr" rid="ref17 ref21">21, 17</xref>
        ] are closely related to the Web communities,
but at a much smaller scale, being comprised by a handful of Web objects in a
single site and thus they are intra-site communities.
      </p>
      <p>We extend the notion of intra-site communities and propose communities
whose topic is much more generic than the topic of logical documents and their</p>
      <p>Pajek
(b) noc.auth.gr
existence is determined by the density of the linkage among the pages that they
are comprised of.</p>
      <p>To support this claim, we examined several Web sites with a crawl
available on the Web. As an intuitive step, we con rmed the existence of such
communities using graph visualization1. As a sample, we present the drawing of
the http://www.hollins.edu Web site (Figure 1(a), whose January 2004 webbot
crawl was available on the Web. We can easily see the co-existence of compound
documents (at the lower right corner), with compact node clusters (at the
upper center), and less apparent clusters (at the upper right of the image). Also,
the resulting image of the http://noc.auth.gr/ (as of Feb 2006) is illustrated at
Figure 1(b).</p>
      <p>In an analogous way, di erent type of communities do exist in an author
citation (or collaboration) graph. An author may have worked in two institutions
(working groups) so he should belong to two communities de ned by the
working groups. One working group may collaborate with another, so both working
groups belong to a higher level group (hierarchical clustering). At the same time,
only one person of a working group may collaborate with another one, so this
person should belong to the cluster de ned by his working group and in a higher
level to the cluster de ned by the second working group plus himself. So, in
citation graphs di erent types of communities do exist at the same time.</p>
      <p>
        In order to cover most of the above community cases, we give a weaker
de nition for communities than Flake et al. did. We de ne a community C(C
G) such that [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]:
dout(C)
din(C)
&lt; s
(1)
where din(C) is the number of links within the community and dout(C) is the
number of links from members to non-members. We set the factor s to 1, so we
have a basic 'community' but we can also set s to any number less than 1 in
order to nd stronger communities.
1 The visualization of all these networks was performed with the visualization package
Pajek [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        The identi cation of communities is essentially a graph clustering procedure,
that its goal is to identify mutually disjoint subsets of vertices, the communities.
The discovery of optimal Web communities as well as any graph clustering is
an NP-hard problem. Thus all the methods proposed rely on some properties of
the graphs. Some methods evaluate only the local neighborhood of a vertex in
order to decide whether it belongs to a speci c community, whereas some other
methods demand examination of the whole Web graph in order to discover such
communities. No matter what method is selected to identify the communities,
there always exists a trade-o associated with this task. This trade-o relates
the `quality' of the discovered communities, i.e., the density of linkage inside
them, with the computational (time and/or resources) cost. In the rest of this
section, we outline the most important methods for community discovery, namely
bibliometric, spectral, maximum- ow and graph-theoretic techniques. The rst
family of methods exploits only local information, the second family is based on
information from the whole graph, whereas the other two families can be tailored
to use either local or global information or a combination of them.
Bibliometric Methods The bibliographic methods attempt to identify
communities by searching for similarity between pairs of vertices. Thus, they have
to answer the question `Are these two pages similar'. To answer this question
they need to de ne a similarity metric for the vertices. There are two two such
metrics that are widely used. The rst is the co-citation coupling and the
second is the bibliographic coupling. Bibliographic techniques are relatively old and
well-established techniques for the discovery of communities. More information
on their application can be found in [
        <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
        ].
      </p>
      <p>
        Spectral Methods The most popular spectral methods for community
identi cation is HITS [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The HITS algorithm takes a subset of the Web graph
based on a keyword match. Then it extents this set by adding nodes that are 2
links away from any node already in the subset. If A is the adjacency matrix of
the subgraph, the matrix products AT A and AAT are symmetric and de nitely
positive. Each of them will have the property of being identical the left and the
right eigenvectors (because of symmetry) and that the rst eigenvector will have
all positive components (with positive eigenvalue). These subsequent
eigenvectors can be used to distinguish pages into di erent communities in a manner
related to the spectral graph partitioning. using a method such that, it was
found that the non-maximal eigenvectors can be used to split pages from a base
set into multiple communities that contain similar text but they are dissimilar
in meaning.
      </p>
      <p>
        Maximum-Flow Methods Flake et al. in a series of papers used the concept
of max- ow/min-cut in order to discover communities in Web graphs. The
algorithm proposed [
        <xref ref-type="bibr" rid="ref11 ref9">9, 11</xref>
        ] works as following. Its input is a graph G, a set of 'seed'
Web pages S, and a single (user-de ned) parameter . The procedure creates a
new graph, G , that has one arti cial vertex t. The sink vertex, t, is connected to
all original vertices with a small capacity speci ed by . After constructing G ,
the procedure calls min-cut for randomly selected source vertex s to t and uses
the resulting residual graph to return the portion of R that remains connected
to s. This connected component is guaranteed to be a community.
Graph-Theoretic Methods We saw earlier that the bibliometric methods try
to identify the \strongest" edges in order to insert the adjacent vertices into a
community. Girvan and Newman ([
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) took the opposite approach, following a
graph-theoretic way. Instead of trying to construct a measure that tells us which
edges are the most central to the communities, instead they focused on those
edges that are least central, the edges that are most \between" the
communities. Rather than constructing communities by adding the strongest edges to an
initially empty vertex set, they construct them by progressively removing edges
from the original graph. They exploited the vertex betweenness, which had been
studied in the past as a measure of the centrality and in uence of nodes in
networks. The betweenness centrality of a vertex i is de ned as the number of the
shortest paths between pairs of other vertices that run through i. They proposed
a simple algorithm and its steps are the following: 1. Calculate the betweenness
for all edges in the network. 2. Remove the edge with the highest betweenness. 3.
Recalculate betweenness for all edges that are a ected by the removal. 4. Repeat
from step 2 until no edges remain. With this process the graph is gradually being
disconnected revealing any communities.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Motivation &amp; Contribution</title>
      <p>The above described techniques have some strong as well as some weak points.
Starting with the bibliographic methods we can say that they can only be applied
to a citation graph and not to a Web-graph because they usually need speci c
information about vertex relationship, i.e. co-authors.</p>
      <p>The Spectral methods can only be used when keyword information is available
over the graph. Also, they are not capable of nding the communities of the graph
but only a community related to a keyword search. This means that it cannot
be applied when \keyword" info is not available or we do not a-priori know the
\keyword" related communities.</p>
      <p>The maximum- ow methods have some major disadvantages. The rst
disadvantage is that they are based on a very \strict" de nition for communities. If
our graph has not such strict communities the method is unable to nd any. The
second one is that they are mainly used in order to nd inter-site communities
by removing the intra-site links. The existence of intra-site links practically
invalidates the results. On the other hand the commutation time is large, and it is
based on several decisions that must be made during the implementation of the
algorithm. So a lot of variations exist. As the variations get better performance,
the computation time increases dramatically. The most important disadvantage
of these methods, is the existence of the factor which must be set manually
and there is no rule for setting it relatively to the graph characteristics. The only
method is to perform a binary search for a good value of which practically
means several failed tries in order to get a clustering.</p>
      <p>Finally the Graph-theoretic methods, as we will describe later, require high
memory usage as well as long computation time.</p>
      <p>In addition to the above, in the real world, a web-page may not belong
strictly to one community, but to more than one. Likewise, a web page may
not belong to any community. Thus the set of communities C1; C2; :::; Ck may
Si Ci G and not always Si Ci = G. There also may exist intersections between
the communities. So it may exist i; j(i 6= j) such that Ci \ Cj 6= ;. This is our
main theoretical di erence with all the rest methods.</p>
      <p>So, our method searches for a set of clusters C = fC1; C2; :::g such that
8Ci 2 C : ddoiunt((CCii)) &lt; s, where s could be user de ned but normally is set to 1.
There may exist in nite clusterings with this property. We focus to a clustering
that minimizes the expression:</p>
      <p>QC =
1</p>
      <p>X
jCj 8Ci2C din(Ci)
dout(Ci)
(2)
4</p>
    </sec>
    <sec id="sec-4">
      <title>Proposed Method</title>
      <p>
        Our target is to nd clusters such that Equation 1 is true. We can build these
clusters by starting from some representative nodes for each one if we could nd
them. We refer to them as kernel nodes. Our rst care should be to nd some
kernel nodes. On the other hand, Betweenness Centrality is a way of showing
how central is every node of a graph G. Many algorithms have been presented in
the bibliography for the calculation of the Betweenness Centrality. The smartest
of them is [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with computational complexity O(nm) and memory requirement
of O(m + n), where n is the number of nodes and m is the number of edges. So,
we can use the nodes with low CB as kernel nodes.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Clustering Using Betweenness Centrality</title>
        <p>
          The notion of CB is used in paper [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] for the clustering of a graph. In this
paper, the CB is calculated for each edge of the graph. At each step, the edge
e with the highest CB is removed from the graph and the CB is recalculated
for some of the edges. In other words, all the shortest paths that contained
the erased edges e are recalculated and the CB is recomputed for all edges.
This procedure is repeated until we get groups that are not connected to each
other (connected components). The complexity of this algorithm is high in time
O(n3), since the CB is recalculated in every step of the algorithm. Moreover, it
has expensive memory requirement O(n2), since we have to store all the shortest
paths for all the node pairs. This forbids the use of this algorithm for large graphs
and specially for dense ones, because the memory requirement becomes huge.
The authors of [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] mention that using our age computers (2003) the described
method can be applied in graphs of about 10000 nodes.
        </p>
        <p>The conclusion of the above paper is that vertices with high CB are near to
the borders of the clusters as well as edges with high CB are inter-cluster edges.
On the other hand vertices and edges with low CB reside at the center of the
clusters or simply are not connected to other clusters.</p>
        <p>The above claim is not true when a part of our graph has a tree-like structure.
A tree-like part of the graph means that in this sub-graph there are no cycles
and the number of edges is equal to the number of vertices. These parts look
like graph tails. The existence of such parts in our graph in ect the previous
statement. All the nodes in such a sub-graph have high CB but they clearly
consist a cluster. We assume that in a Web-graph these parts do not cover a big
ratio of the graph. We refer to these tree-like parts of the graph as graph tails.</p>
        <p>Practically, tree-like tails in a web graph represent virtual documents which
do not have links pointing out of the document. So, they may consist of
independent clusters of our graph or they may be members of other clusters.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Clustering Method CBC</title>
        <p>The algorithm CBC (Clustering with Betweenness Centrality) begins with the
knowledge that the nodes with the lower CB are members of clusters and they
are not connected directly to other clusters. Initially we remove the graph tails
0
from graph G as described before, resulting to a new graph G . We compute the
0
centrality based on the G . The rest of the procedure is depicted in Figure 2,
where C is initially an empty set of clusters.</p>
        <p>Clique Formation The nodes with the lower CB are the cluster kernels. So,
we can build some initial small clusters around them. This is depicted as pseudo
code in Figure 3. These small clusters are the graph Cliques. Note here that our
term Cliques is di erent from the one used in the bibliography where usually
means a fully connected sub-graph. The Cliques for us are the areas around the
it kernel nodes. The clique size may vary and this is a function of how dense or
sparse the graph is. For a dense graph the cliques consist of all the nodes that
are directly connected to the kernel node. For sparse graphs the cliques may be
larger. The optimal clique size, can not be known apriori, since the graph may be
function InitClustering(graph G,G', clustering C, int max_clique_size) {</p>
        <p>InitiateCliques(G',C,max_clique_size);
ExtentTailedClusters(G,C); // Add the tails of size 1</p>
        <p>// to the cliques that they are connected
MakeTailedClusters(G,C); // Make the tree-like tails independent Cliques</p>
        <p>MergeTailedClusters(G,C); // Merge tail-clusters until reach the max-cluster-size
}
dense in some areas, but sparse in other ones. The rst time that InitiateCliques
is called, the Maximum Clique Size parameter is set to zero. Thus, the Cliques
that will be built during the rst iteration of the algorithm have a diameter of
two. Note here that if an Initial Clique has a size of only one or two nodes, it is
being erased and ignored. This means that after the rst step we may have some
orphan nodes. These will be nodes with high CB and usually located between
the clusters. The computation cost of this step is a linear function of the size of
the graph: O(n).</p>
        <p>Clique Merging Having a set of cliques the next step is to merge them in
order to construct correlated clusters. The cliques may not be correlated. The
pseudo-code is presented in Figure 4. Assuming that in the previous step we
found l number of clusters (cliques), in function Merge we build a l l matrix
B. Each element B[i; j] corresponds to the number of edges from cluster Ci to
cluster Cj . The diagonal elements B[i; i] correspond to the number of internal
edges of the clique. The matrix B is obviously symmetric. Note here that if a
node x belongs to clusters Ci and Cj , and there is an edge x ! y(y 2 Ci), then
this edge counts once for B[i; i] since it is an internal edge, but it also counts for
B[i; j], since y belongs also to Cj . So, the sum of a row of matrix B is not equal
to the total number of edges.</p>
        <p>This merging step of the algorithm consists of several iterations. In each
iteration one merge is performed. The iterations stop when there is not any
other mergable pair. The pair that will be merged is the pair with the maximum
B[i; j]=B[i; i]. The conditions for a merge may vary and they depend on the
user parameters if there are any. A parameter may be the factor s, which is
mentioned in the de nition of a community. So, in this step we check every
pair of the clusters (cliques) and we select the best one for merging. A merge
cannot be done if the two clusters are already correlated or their union is greater
than the maximum cluster size that the user may have set. A merge of clusters
function ClusterMerge(graph G,clustering C...) {
while(! ok) {</p>
        <p>MakeCliquesStronger(C,G);
Merge(G,C);
if(c-&gt;best_quality==0) {manage_subsets(c);}
delete_the_worst(C); // Delete a cluster if does not fulfil parameters
add_orphans_to_cliques(G,C);
ok=check(C);
if(!ok) InitClustering(G,C);
}
}
Ci; Cj can be done if B[i; j] &gt; Pk B[i; k]=2 or if at least one of the Ci; Cj is not
correlated or if B[i; j]=B[i; i] &gt;= s and we cannot nd any better pair.</p>
        <p>If the initial number of cliques is l, then the maximum number of iterations
that will be executed is l. Each iteration checks every pair of cliques, so the time
The value of l depends on t1h)e2 +gra::p: hwchhicahramcteearinsstictsh,abtuttheit ciosmapfluenxcittyionis oOf (pl3n).,
that is needed is l2 + (l
where n is the number of nodes in our graph2. So the time complexity, with
respect to the number of nodes is O(npn) in the worst case. The memory space
requirements is a matrix l l which means O(l2), that is near to O(n).</p>
        <p>In Figure 4 there are various steps in order to improve quality and/or speed
depending on our needs. For example, the procedure named MakeCliquesStronger
can be used for moving nodes between clusters. Its default behavior is to check
for all the nodes (in O(n) time) the number of the links that they have to each
cluster. So, a node may change cluster if there exists another cluster to which the
node has more links. The function ManageSubsets is used to remove any clusters
that are subsets of other clusters. It will be called only for speed optimization.
Finally we may add orphan nodes to clusters, even if the resulting clustering is
not better than the current one. This will lead the algorithm to minimize the
number of orphan nodes but the resulting quality factor of Equation 2 QC will
be worst.</p>
        <p>Finally the clustering is checked if it ful ls our constraints. In not, we
reinitialize into cliques the nodes that remained as orphans. Each time that the
InitClustering is called, we use a new value for the factor max clique size. In
our implementation the sequence of values is: pn; pn=2; 2pn; pn=3; 3pn; :::. In
most cases that we faced during experimentation, the clustering is computed in
two repetitions of function InitClustering and in a few cases in four. Of course
if the cluster sizes vary, we expect that more repetitions will be needed.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Method Evaluation</title>
      <p>
        As mentioned in the previous section, an analogous idea is the one described
in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The major disadvantage of that algorithm is the huge complexity, since
every step of the algorithm it requires recomputation of the centrality. Although
2 Given that we have initially set the parameter max clique size to pn.
the computation is incremental, the time and space complexity is very high. In
this work we will not compare against this algorithm, since the di erence in the
complexity and the memory requirements is obvious.
      </p>
      <p>
        On the other hand, a comparison with Flake's algorithm could not be done
for three reasons. The rst one is that we have di erent de nitions of what is
a cluster, so Flake's methods gives di erent types of clusters. The second one
is that Flake searches for hierarchical clustering. In our case we do not search
for hierachies in the clusters. We search for a set of clusters that ful ls our
contraints, no matter in which hierarchy depths these clusters reside. Finally, as
it is stated in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] their method is applied for nding intra-site communities (by
removing the inter-site links). In our case we search for inter-site communities.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Evaluation Dataset and Method</title>
        <p>The evaluation Dataset consists of real and synthetic Web-graphs. The real
Dataset includes the web sites of: noc.auth.gr (as of Feb 2006), www.hollings.edu
(as of Jan 2004) and www.unicef.org (as of Jan 2006).</p>
        <p>
          The synthetic Web-graphs were generated with the FWgen tool [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. The
parameters that must be given to FWgen are ve: (a) number of nodes, (b)
number of edges or density related to the respective fully connected graph, (c)
number of clusters to generate, (d) skew, which re ects the relative sizes of
the generated clusters, and nally (e) the assortativity factor, which gives the
percentage of the edges that will be intra-community edges. The values that are
meaningful for assortativity are greater than 50%. The higher the assortativity is,
the stronger clusters are produced. If the assortativity is 100% then the generated
clusters will be disconnected to each other.
        </p>
        <p>The generator creates two les. The rst one is the graph and the second
one records the produced clusters so that we can compare the clustering of
the CBC to the \generated optimal". Since the generation of the edges follows
random decisions, the \generated optimal" clustering may not be identical to
the \absolutely optimal". With the term \absolutely optimal" we mean the
clustering that minimizes the factor QC. This will be shown later in this section.</p>
        <p>The method is evaluated for two dimensions: quality and speed. The
evaluation for quality could be done only with the synthetic graphs, for which we
know apriori the clusters that are constructed. So, we count the distance of our
method from the optimal clustering by using a distance metric explained in
Appendix A. We also count the value QC of Equation 2. The smaller this value
is, the better clustering is produced. On the other hand the evaluation of the
clustering speed is trivial. We count the real-time of the algorithm execution
(CPU time occupied by the process). For the real dataset we present statistical
results of the clustering.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Experiments</title>
        <p>In Figures 5(a,b) we present the speed performance of our algorithm. The
presented \CPU Time" is reported by the unix kernel by using the system call
200
400 600
Number of nodes
(a)
800</p>
        <p>1000
Fig. 5. Graphs attributes: Nodes: n, Edges: 15 n, clusters: 5 (n &lt; 1000), 10 (1000
n &lt; 10000), 100 (10000 n), assortativity: 0.85, skew:0.1
times() and represents the time that the process remained in CPU. The line
with diagonal crosses represents the time needed to compute the centrality. The
line with cross points represents the time needed by our algorithm to compute
the clustering. Finally the line with star points represents the summation of all
the previous. As we can see, our algorithm needs much less time than the
centrality betweenness, which is proved to have O(mn) complexity. In Figures 5(c,d),
we verify that the CB computation is linear to n m, so our time measurements
are correct. Thus, if we use a centrality approximation algorithm, we will be able
to cluster really huge graphs consisting of a lot more than 200000 nodes.</p>
        <p>In Figure 6 we present the results of the clustering that are related to the
assortativity parameter. Figure 6(a) shows the distance of our clustering from the
\generated optimal". The distance is computed by using the method presented
in Appendix A. The line with the cross points stands for our CBC algorithm,
while the line with the diagonal cross points stands for our algorithm that uses
the option minimize orphan nodes set. As we can see, the minimize orphan
nodes version gives a clustering closer to the \generated optimal". This happens
because in the \generated optimal" clustering there are no orphan nodes. The
distance from the \optimal" is 1% in the worst case, and it converges to zero
as the clusters become stronger. On the other hand, in Figure 6(b) we present
the quality of the clustering. It is expressed with the factor QC that is de ned
in Equation 2. It is obvious that when the clusters are strong the quality of the
clustering is better. Hereby we must note that both our CBC versions keep the
quality very close to the \generated optimal" clustering and they are always
1
better than it. This is due to the fact that the generator does not produces
optimal clusters, but if they do exist in the graph, our algorithm is able to nd
them. This explains the fact that in Figure 6(a) the distance from the \generated
optimal" clustering is not zero.</p>
        <p>In Figure 7, we keep the graph characteristics stable and we change the
number of edges. As we can see the time that is needed for the clustering remains
small. Finally, in Figure 8, in x-axes the number of clusters is varied. It is shown
that when the clusters are few, the required time is higher from the one that is
needed than more clusters. This is due to the fact that more merge operations
must be performed.</p>
        <p>In Table 1, we present summarization of the results for the real Web-graphs.
The rst 3 columns de ne the graph. Columns MC and MS stand for the user
parameters maximum cluster size and minimum cluster size. Next column
(Clusters) contains the number of clusters that have been found during each run. QC
is the resulting Quality (Equation 2) of the clustering. Column \Or" denotes
the number of the orphan nodes that remained in the graph. Finally, the Drate
column shows the percentage of the nodes that belong to more than one cluster.
It is computed as:</p>
        <p>Drate =</p>
        <p>P8i2G N (i)</p>
        <p>NC
where N (i) is the number of clusters that node i belongs to, and NC is the
number of nodes that belong to at least one cluster. A value of 1 for Drate
means that all nodes belong to exactly one cluster.</p>
        <p>Site
noc.auth
noc.auth
hollins
unicef
unicef</p>
        <p>Since the possible clusters that may have dout=din &lt; s may be in nite, we
must somehow focus in some of them. For this reason, our implementation takes
two parameters. The rst one is the minimum cluster size (MS in Table 1)
and the second one is the maximum cluster size relatively to the graph size
(MC in Table 1). It is obvious that these two parameters a ect the results.
For example the rst try to cluster the site noc.auth.gr used the default value
50% as MC. This caused the algorithm to leave a lot of orphaned nodes. The
second run was executed by using MC=80%. This produced a big cluster of 496
nodes (Table 2), which is greater than the half of the graph and as we saw it
could not be splited into smaller clusters. The results of this clustering are also
visualized in Figure 1(b) where each cluster is presented with di erent color.
Unfortunately, it is impossible to visualize the nodes which belong to more than
one cluster since one node can have only one color. These nodes get the color
from a randomly selected cluster among the ones they belong. Full results for all
these experiments are available at http://delab.csd.auth.gr/~asidirop/clustering.</p>
        <p>Finally, we can conclude that the CBC requires time O(nm), which is actually
the time needed to compute the Betweenness Centrality. The memory space
requirements are O(n) for the merging procedure plus about O(n) in order to
store foreach cluster the nodes that contains. Over the synthetic graphs, the
measured clustering quality is always better than the quality of the `generated
optimal' clustering. The distance between these two clustering is at most 1% and
it depends on what we are looking for. Over the real Web-graphs it is obvious
that the quality of the clusters depends on the Web-graph itself. When we are
searching for large clusters it is obvious that the quality factor QC will be better.
The sizes of the clusters may vary depending on the application that the results
will be used. Our algorithm is able of nding clusters of any desirable size if they
do exist in the Web-graph.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions &amp; Future Work</title>
      <p>
        In this paper, we made an overview of clustering methods. We also presented
our method called CBC which requires time O(mn) and memory O(n), as well
as experiments over both synthetic and real datasets. The experiments show, as
expected, that this method is very fast and can be used in order to cluster huge
Web-graphs. As it is shown, the slow part of the method is the computation of
the betweenness centrality. As a future work, we plan to use a centrality
approximation algorithm to test the clustering speed and the quality performance. This
method is also being tested for prefetching methods over a content distribution
network[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Acknowledgments: We deeply acknowledge Ulrik Brandes help by providing
us the implementation of the Betweenness Centrality Computation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Additionally, we are really grateful to Dimitrios Katsaros for his inspiring ideas as
well as for the implementation of the graph generator FWgen [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>Appendix A.</title>
    </sec>
    <sec id="sec-8">
      <title>Clustering Comparison</title>
      <p>In this section we will give a de nition for a distance function for 2 clusterings.</p>
      <p>We de ne the function N (n; c) to be 1, if node n belongs to cluster c and 0
otherwise. Also, function K(n; C) gives the set of clusters that node n belongs
to. The number of clusters that a node belongs to may be zero, one or any other
number in the range of [0::jCj] when the node belongs to more than one clusters.</p>
      <p>The similarity S of a node n1 to node n2, given a set of clusters C, is set to
be the percent of the occurrences of node n2 in K(n1; C).</p>
      <p>S(n1; n2; C) =</p>
      <p>P
8c2K(n1;C)</p>
      <p>N (n2; c)
jK(n1; C)j</p>
      <p>In the case where a node can belong only to one cluster, then the function S
will get the value of 1 if the two nodes are members of the same cluster, and 0
otherwise. In the general case that a node can belong to more than one cluster,
then when two nodes always resize in the same clusters, S will be 1. If two nodes
never resize in the same clusters, S will be 0. S will be 0:5, if the rst node
belongs to 2 clusters and the second node belongs to only one of them, etc. Note
here that it could be S(n1; n2; C) 6= S(n2; n1; C), in the case that the nodes are
able to belong to di erent number of clusters. I a node can belong to only one
cluster, then S(n1; n2; C) = S(n2; n1; C)</p>
      <p>D(CA; CB; G) =</p>
      <p>P P jS(n1; n2; CA)
8n12V 8n22V</p>
      <p>S(n1; n2; CB)j
jV j(jV j
1j)
(3)</p>
      <p>In order to be able to compare 2 clusterings, we will give the de nition of
the equation D(CA; CB; G) (3), where CA is the set of clusters created by method
A and CB is the set of clusters created by method B over the graph G = (V; E).
D is calculated as the average value of the similarity di erences in these two
clusterings for every nodes pair. D is normalized in the scale of [0::1]. I the two
clusterings CA and CB are equal, then D will be 0. The worst and only case that
D may be 1, is the following: CA has only one cluster and all the nodes of graph
G belong to that cluster. CB has as many clusters as the number of nodes and
every cluster consists of only one node.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Reka</given-names>
            <surname>Albert</surname>
          </string-name>
          and
          <article-title>Albert-Laszlo Barabasi. Statistical Mechanics of Complex Networks</article-title>
          .
          <source>Reviews of Modern Physics</source>
          ,
          <volume>74</volume>
          :
          <fpage>47</fpage>
          {
          <fpage>97</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Albert-Laszlo Barabasi</surname>
            and
            <given-names>Reka</given-names>
          </string-name>
          <string-name>
            <surname>Albert</surname>
          </string-name>
          .
          <article-title>Emergence of Scaling in Random Networks</article-title>
          .
          <source>Science</source>
          ,
          <volume>286</volume>
          :
          <fpage>509</fpage>
          {
          <fpage>512</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Ginestra</given-names>
            <surname>Bianconi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Albert-Laszlo</given-names>
            <surname>Barabasi</surname>
          </string-name>
          .
          <article-title>Bose-einstein Condensation in Complex Networks</article-title>
          .
          <source>Physical Review Letters</source>
          ,
          <volume>86</volume>
          (
          <issue>24</issue>
          ):
          <volume>5632</volume>
          {
          <fpage>5635</fpage>
          ,
          <year>June 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Ulrik</given-names>
            <surname>Brandes</surname>
          </string-name>
          .
          <article-title>A Faster Algorithm for Betweenness Centrality</article-title>
          .
          <source>Journal of Mathematical Sociology</source>
          ,
          <volume>25</volume>
          (
          <issue>2</issue>
          ):
          <volume>163</volume>
          {
          <fpage>177</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Soumen</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          .
          <article-title>Mining the Web: Discovering Knowledge from Hypertext Data, chapter 7.1</article-title>
          , pages
          <fpage>205</fpage>
          {
          <fpage>206</fpage>
          . Morgan Kaufmann Publishers,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Wouter de Nooy, Andrej Mrvar, and
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Batagelj</surname>
          </string-name>
          .
          <article-title>Exploratory Social Network Analysis with Pajek</article-title>
          . Cambridge University Press,
          <year>March 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Dmitriev</surname>
          </string-name>
          , Carl Lagoze, and
          <string-name>
            <given-names>Boris</given-names>
            <surname>Suchkov</surname>
          </string-name>
          .
          <article-title>As We may Perceive: Inferring Logical Documents from Hypertext</article-title>
          .
          <source>In Proceedings ACM International Conference on Hypertext and Hypermedia (HT'2005)</source>
          , pages
          <fpage>66</fpage>
          {
          <fpage>74</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Nadav</given-names>
            <surname>Eiron</surname>
          </string-name>
          and
          <string-name>
            <surname>Kevin S. McCurley. Untangling Compound</surname>
          </string-name>
          <article-title>Documents on the Web</article-title>
          .
          <source>In Proceedings ACM International Conference on Hypertext and Hypermedia (HT'2003)</source>
          , pages
          <fpage>85</fpage>
          {
          <fpage>94</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Gary</given-names>
            <surname>William</surname>
          </string-name>
          <string-name>
            <surname>Flake</surname>
          </string-name>
          , Steve Lawrence,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee Giles</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Frans</given-names>
            <surname>Coetzee</surname>
          </string-name>
          .
          <article-title>Selforganization of the Wweb and Identi cation of Communities</article-title>
          . IEEE Computer,
          <volume>35</volume>
          (
          <issue>3</issue>
          ):
          <volume>66</volume>
          {
          <fpage>71</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Gary William Flake, Steve Lawrence, and Clyde Lee Giles.
          <article-title>E cient Identi cation of Web Communities</article-title>
          .
          <source>In Proceedings 8th ACM International Conference on Knowledge Discovery in Data (KDD'2002)</source>
          , pages
          <fpage>150</fpage>
          {
          <fpage>160</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Gary William Flake, Robert E. Tarjan, and
          <string-name>
            <given-names>Kostas</given-names>
            <surname>Tsioutsiouklis</surname>
          </string-name>
          .
          <article-title>Graph Clustering and Minimum Cut Trees</article-title>
          .
          <source>Internet Mathematics</source>
          ,
          <volume>1</volume>
          (
          <issue>4</issue>
          ):
          <volume>385</volume>
          {
          <fpage>408</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Gary William Flake, Kostas Tsioutsiouliklis, and
          <string-name>
            <given-names>Leonid</given-names>
            <surname>Zhukov</surname>
          </string-name>
          .
          <article-title>Methods for Mining Web Communities: Bibliometric, Spectral, and Flow</article-title>
          . In Web Dynamics, pages
          <volume>45</volume>
          {
          <fpage>68</fpage>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. David Gibson,
          <string-name>
            <given-names>Jon M.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Prabhakar</given-names>
            <surname>Raghavan</surname>
          </string-name>
          .
          <article-title>Inferring Web Communities from Link Topology</article-title>
          .
          <source>In Proceedings ACM International Conference on Hypertext and Hypermedia (HT'98)</source>
          , pages
          <fpage>225</fpage>
          {
          <fpage>234</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Michelle</given-names>
            <surname>Girvan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark E. J.</given-names>
            <surname>Newman</surname>
          </string-name>
          .
          <article-title>Community Structure in Social and Biological Networks</article-title>
          .
          <source>Proceedings National Academy of Sciences</source>
          ,
          <volume>99</volume>
          :
          <fpage>7821</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Dimitrios Katsaros.
          <article-title>FWgen: Generating Web-graphs with Communities</article-title>
          .
          <source>Technical report</source>
          , Aristotle University,
          <year>2006</year>
          .
          <article-title>manuscript under preparation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Jon</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kleinberg</surname>
          </string-name>
          .
          <article-title>Authoritative Sources in a Hyperlinked Environment</article-title>
          .
          <source>Journal of the ACM</source>
          ,
          <volume>46</volume>
          (
          <issue>5</issue>
          ):
          <volume>604</volume>
          {
          <fpage>632</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wen-Syan Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Selcuk Candan</surname>
            , Quoc Vu, and
            <given-names>Divyakant</given-names>
          </string-name>
          <string-name>
            <surname>Agrawal</surname>
          </string-name>
          .
          <article-title>Query Relaxation by Structure and Semantics for Retrieval of Logical Web Documents</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>14</volume>
          (
          <issue>4</issue>
          ):
          <volume>768</volume>
          {
          <fpage>791</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Mark</surname>
            <given-names>E.J.</given-names>
          </string-name>
          <string-name>
            <surname>Newman</surname>
            and
            <given-names>Michelle</given-names>
          </string-name>
          <string-name>
            <surname>Girvan</surname>
          </string-name>
          .
          <article-title>Finding and Evaluating Community Structure in Networks</article-title>
          . Physical
          <string-name>
            <surname>Review</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <volume>69</volume>
          :
          <fpage>026113</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Antonis</given-names>
            <surname>Sidiropoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yiannis</given-names>
            <surname>Manolopoulos</surname>
          </string-name>
          .
          <article-title>A New Perspective to Automatically Rank Scienti c Conferences Using Digital Libraries</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>41</volume>
          (
          <issue>2</issue>
          ):
          <volume>289</volume>
          {
          <fpage>312</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Antonis</surname>
            <given-names>Sidiropoulos</given-names>
          </string-name>
          , George A.
          <string-name>
            <surname>Pallis</surname>
            , Dimitrios Katsaros, Konstantinos Stamos, Athena Vakali, and
            <given-names>Yannis</given-names>
          </string-name>
          <string-name>
            <surname>Manolopoulos</surname>
          </string-name>
          .
          <article-title>Prefetching in Content Distribution Networks via Web Communities Identi cation and Outsourcing</article-title>
          .
          <source>Technical report</source>
          , Aristotle University,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Keishi</surname>
            <given-names>Tajima</given-names>
          </string-name>
          , Kenji Hatano, Takeshi Matsukura, Ryouichi Sano, and
          <string-name>
            <given-names>Katsumi</given-names>
            <surname>Tanaka</surname>
          </string-name>
          .
          <article-title>Discovery and Retrieval of Logical Information Units in Web</article-title>
          .
          <source>In Proceedings Workshop on Organizing Web Space (WOWS'99)</source>
          , pages
          <fpage>13</fpage>
          {
          <fpage>23</fpage>
          , Berkeley, CA,
          <year>August 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Duncan</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Watts</surname>
          </string-name>
          and
          <string-name>
            <surname>Steven H. Strogatz</surname>
          </string-name>
          . Collective Dynamics of `Small-world'
          <source>Networks. Nature</source>
          ,
          <volume>393</volume>
          :
          <fpage>440</fpage>
          {
          <fpage>442</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>