Proceedings of the International Conference on Big Data, Cloud and Applications
Tetuan, Morocco, May 25 - 26, 2015


        Towards a new approach for community detection
                  algorithm in social networks

                                      Sara AHAJJAM, Hassan BADIR , Mohamed EL HADDAD
                                               Laboratory of Technologies and Communication
                                                    National School of Applied Sciences
                                                              Tangier, Morocco
                                                          Ahajjamsara@gmail.com
                                                             hbadir@gmail.com
                                                      elhaddad.mohamed@gmail.com


        Abstract— In the recent years, social networks emerged            campaigns. This fact caused many researchers to look for an
    rapidly and it's has become more complex. Social networks play        efficient method for finding top-k most influential people
    an important role in the dissemination of information and the         through social networks.
    spread of influence. Several research studies are interested to the   We are interested to study the problematic of detection of
    detection of the structure of complex networks, otherwise, to the
                                                                          communities and leaders’ nodes in complex network. Those
    community detection and leader detection. The major drawback
    of most of the proposed algorithms is that they require knowledge     nodes have high connectivity with the others nodes, and
    of number of communities to detect. Our approach proposes an          represent an optimization of the network while maintaining the
    algorithm for the detection of communities in social networks,        same characteristics of the network. The major drawback of
    especially the detection of leader nodes (influencer’s nodes)         most of the proposed approaches is that they require
    without a priori knowledge of the number of communities or            knowledge of k leader and communities to detect. In this
    leaders to detect.                                                    paper, we introduce a new approach to detect leaders’ nodes
                                                                          and communities in the network without a prior knowledge of
       Keywords—Community detection; leader node; centrality;             k nodes to detect. This problem has many applications such as:
    complex networks; graph theory.
                                                                          opinion propagation, studying acceptance of political
                                                                          movements or acceptance of technology in economics.
                           I. INTRODUCTION                                Actually, identifying influential nodes in networks, also
    Complex networks are a powerful tool for understanding the            regarded as ranking important nodes has become one of the
    mechanisms of various systems. They are modeled by graphs             three main problems in network-based information retrieval
    with vertices denote the actors of phenomenon and links               and mining [2]. In biological systems, we might like to
    denote the interactions between vertices. These graphs                identify the nodes that are keys to communities and protect
    represents different systems such as collaboration networks,          them or disrupt them, such as in the case of lung cancer [2]. In
    citation network, protein interaction networks, WWW and               epidemic spreading, we would like to find the important nodes
    social networks,.., etc. These networks are complex graphs            to understand the dynamic processes, which could yield an
    with high local density and low overall density, they play a          efficient method to immunize modular networks [2]. Such
    fundamental role in the diffusion of information, ideas and           strategies would greatly benefit from a quantitative
    innovation, this advantage has been the subject of various            characterization of the node importance to community
    parts that have moved towards these networks to achieve               structure. For example, suppose that we need to advertise a
    advertising goals (ads on Facebook), educational (LinkedIn),          product in a country or we need to propagate news. For this
    or political (Election of USA on Twitter). The key property of        purpose, we need to choose some people as a starting point
    a real network is its community structure. The communities            and maximize the news or the products influence in the target
    are groups of nodes, with more links connecting to nodes of           society. The problem was introduced in [3] for the first time.
    the same group and comparatively fewer links connecting to            After that in [4] the authors formalized the problem as follows:
    nodes of different groups. Recent studies have verified that the      given a weighted graph in which nodes are people and edge
    way in which such nodes are organized plays a fundamental             weights represent influence of the people on each other, it is
    role in spreading processes [1]. Study the influence of role          desired to find k starting nodes that their activation leads to
    models can help us to better understand why some trends or            maximum propagation In particular, we will focus our
    innovations are adopted more quickly than others and how we           attention in one topological feature: centrality. Since those
    can help advertisers and marketers to design more effective           central nodes can diffuse their influence to the whole network


                                                                                                                                      23
faster than the rest of nodes and they are the most influential
spreaders.
                                                                                                                                     (3)
                 II. CENTRALITY MEASURES
The variety of measures of centrality comes from the fact that         Where: d(i,j ;g) is the geodesic distance between i and j.
the importance of a node depends on other parameters such as
connectivity and orientation in the graph and the nature of            -         Eigenvector Centrality:
measurement of the entire network [5]. The work of Linton                   It simulates a mechanism in which each vertex affects all of
Freeman is probably one of the most important contributions                its neighbors simultaneously [16]. Eigenvector centrality is a
to the analysis of social networks and networking in general.              sort of extended degree centrality which is proportional to
There are three varieties of measures of centrality [6], we’ll             the sum of the centralities of the vertex’s neighbors. A vertex
cite the main ones:                                                        has large value of eigenvector centrality score either if it is
-     Degree Centrality:                                                   connected to many other vertices or if it is connected to
It is defined as the number of links incident upon a vertex                others that themselves have high eigenvector centrality [17].
which means the number of edges a vertex has. For a graph G                The eigenvector centrality score of the 𝑖th vertex in the
:=( V, E) with n vertices, the degree centrality Cd (i,g) for              network is defined as the 𝑖th component of the eigenvector
vertex is:                                                                 corresponding to the greatest eigenvalue of the following
                                                                           characteristic equation:

                                                                                                   X= 𝜆X(4)
                                                                 (1)
Where: di(g) is the degree of the node i.
                                                                             Where: 𝐴 is the adjacency matrix of the network, 𝜆 is the
                                                                           largest eigenvalue of 𝐴, and 𝑥 is the corresponding
-     Betweenness Centrality:
                                                                           eigenvector.
Vertices that occur on many shortest paths between other
vertices have higher betweenness than those that do not. For a                               III. RELATED WORKS
graph G :=( V, E) with n vertices, the betweenness Cb (i) for
                                                                       The community detection algorithms have been the subject of
vertex is computed as follows:
                                                                       several research papers. Most studies classify articles and
                                                                       research methods depending on the type of the algorithm. The
1. For each pair of vertices (s, t), compute all shortest paths
                                                                       community detection algorithms are belonging to two main
between them.
                                                                       types of approaches namely graph partitioning and
2. For each pair of vertices (s, t), determine the fraction of
                                                                       classification. The major drawback of methods based on the
shortest paths that pass through the vertex in question (here,
                                                                       partitioning of graphs is that they require a prior knowledge of
vertex v).
                                                                       the number and size of groups to determine [9]. Also, the
3. Sum this fraction over all pairs of vertices (s, t).
                                                                       leader detection approaches are divided to two mains types:
                                                                       global and local methods. The global method deals with all the
The betweenness centrality is:
                                                                       network topology (betweenness centrality), while the local
                                                                       ones treat with local position, i.e. with the node (degree
                                                                       centrality). Reihaneh Rabbany Khorasgani et al. suggest a
                                                    new approach to detect leaders nodes that takes into account
                                                                       the nodes that are not associated with no leaders. This
With :         is the probability that i falls on a randomly           algorithm is inspired from k-means, the k nodes to be detected
                                                                       will be randomly selected. Other nodes will be assembled at
selected geodesic connecting k and j.
                                                                       their closest leaders to form communities, and then find new
                                                                       leaders for each community around which gather followers
-     Closeness Centrality:
                                                                       until no node moves. For each community, the centrality of
In graph theory closeness is a centrality measure of a vertex
                                                                       each member is calculated and the node with the highest
within a graph. Vertices that are 'shallow' to other vertices
                                                                       degree is chosen as the new leader [10]. Another algorithm of
(that is, those that tend to have short geodesic distances to
                                                                       leaders’ nodes detection in complex networks proposed by
other vertices within the graph) have higher closeness.
                                                                       Kernighan and Lin based on partitioning of graphs. This
Closeness is preferred in network analysis to mean shortest-
                                                                       algorithm tries to find a section of the graph minimizing the
path length, as it gives higher values to more central vertices,
                                                                       number of edges between partitions by trading vertices
and so is usually positively associated with other measures
                                                                       between these partitions. The results of this algorithm are
such as degree [7].
                                                                       generated by introducing the size of each partition [11]. The
                                                                       results of these two algorithms vary according to the size and
The closeness centrality is:
                                                                       number of partitions which are introduced. Other proposed
                                                                       studies use classification. The classification was introduced to


                                                                                                                                        24
analyze the data and partition based on a measure of similarity      large-scale social networks. The main idea, that it reduces the
between partitions. The problem of communities detection can         scale of network by eliminating the node located in the
be seen as a problem of data classification for which we need        peripheral layer (namely relatively small ks value) that will not
to select an appropriate distance [12]. Indeed, the classification   have much spreading potency comparing with the core node in
methods are generally appropriate for some networks that have        general, and vice versa. This algorithm uses the k-
a hierarchical structure. The result obtained by these methods       decomposition centrality to deal only with the nodes in the
depends on choice of similarity measure that used initially.         core of the network. Hence, it reduce the scale of the network
Blondel et al. have proposed the Louvain method that put each        by ignoring the nodes whose ks value is small and the links
node in a vertex. Other approaches are based on partitioned          connected them and retain the nodes in the core layers. At last,
classification which is like the partitioning of the graph           the global methods (i.e. betweenness centrality and closeness
requires prior knowledge of size and number of communities           centrality) are used to rank the most influential spreaders [15].
to detect. Another study focuses on the spectral classification.     A novel approach to detect communities and important nodes
In the Leader-Follower algorithm, we define some internal            of the detected communities using the spectrum of the graph.
structure of a community. A community should be a clique             It defines the importance nodes to community as the relative
and is formed of a leader and at least one "loyal follower"          changes in the c largest eigenvalues of the network adjacency
which is a node in the community without neighbors in any            matrix upon their removal. It has two types of nodes, the core
other community. The leader is a node whose distance is less         nodes who are the central nodes and the most important for the
than at least one of its neighbors. The nodes will be allocated      community, and the bridges node who connect the
to the community in which a majority of its neighbors belong         communities to each other’s. The main drawback of this
by destroying the links arbitrarily. However, parasites              approach, it is that to have a better result, they need to know
communities i.e. leaders without loyal follower assigned will        the number of partitions in the network and it cannot identify
be removed from the network. This can cause a loss of                the important nodes in the small communities when the
information [13]. Yunlong Zhang et al propose a greedy               communities are in very different size has the same size. It
algorithm based on user preferences (GAUP) to operate the            cannot identify the important nodes in the small communities
top-k influential users, based on the model Extended                 when the communities are in very different size [17].
Independent Cascade (EIC said that an active node v is active        Community and leader nodes detection approaches are
in t-1, has only one chance to activate all inactive neighbors).     diverse. Each proposed algorithm brings a new idea or
During each cycle i, the algorithm adds a record in the selected     improvement of existing algorithms. We will propose a new
set such that the vertex S with the current set S maximizes          approach to detect communities and leader nodes in complex
propagation of the influence. This means that the vertex             networks without a priori knowledge of number of
selected in round i is the one that maximizes the incremental        communities to detect.
propagation influence in this cycle. This algorithm calculates
the user's preferences for different subjects, and combines                           IV. PROPOSED ALGORITHM
traditional greedy algorithms and preferences calculated by          Identifying social influence in networks is critical to
LSI user and calculates an approximate solution of the               understanding how behaviors spread. In order to detect the
problem of maximizing the influence of a specific topic. This        catalyst of this influence, we need to detect the central nodes
algorithm provides a good result if k exceeds a certain              that are responsible for the dissemination of influence. Analysis
threshold k≥ 15 and it is of complexity O(n3) [14]. More             on social network datasets reveals that in each community,
recently, in [14], the authors derive an upper bound for the         there is usually some member (or leader) who plays a key role
spread function under the LT model. They propose an efficient        in that community. In fact, centrality is an important concept
UBLF algorithm by incorporating the bound into CELF.                 [13] within social network analysis, which measures the
Recent research found that the location of the node in the           relative importance of a vertex within the graph. Different from
                                                                     others methods, our approach detect leaders, and build
network topology is another important factor when estimating
                                                                     communities around these leaders without a priori knowledge
the spreading ability. According to that, [15] propose a new
                                                                     of k leader to detect.
approach to identify the location of node through the k-shell
decomposition method, by which the network is divided into             Given an input dataset, the dataset is modeled as an
several layers. Each node corresponding one layer and the            undirected and unweighted graph 𝐺 = (𝑉, 𝐸). 𝑉 is the vertex
entire network formed the core-periphery structure. K-shell          set. Each vertex in 𝑉 represents an element in the dataset. |(𝐺)|
decomposition method indicates that the inner the layer is, the      represents the number of vertices in 𝐺 (or elements in the
more important the node. However, in practical applications          dataset). 𝐸 is the edge set. Each edge represents a relationship
there are often too many nodes having the same index value by        between a pair of elements. Our approach has three steps as in
employing these two methods to distinguish which node is             “Fig. 1”:
more powerful. Generally speaking, DC and k-shell                        Nodes centrality: For each node v in the network G,
decomposition are suitable to measure the spreading ability of            calculate the eigenvector centrality. Eigenvector
nodes quickly but not very accurate. Another proposed                     centrality or Gould’s index of accessibility [17] is a
algorithm use both global and local methods of centrality                 measure that describes how well connected an
measures to effectively identifying the influential spreaders in          individual isbased on direct and indirect relationships


                                                                                                                                    25
       (i.e., it takes into account the connections of the                  nodes that represent bottlenose dolphins living in Doubtful
       individuals the focal individual is connected to [18].               Sound, New Zealand, and 159 edges that represent
       Because eigenvector centrality is proportional to an                 associations between dolphin pairs observed to co-occur
       individual’s neighbors’ centralities[19], more influential           more often than expected occasionally.
       individuals will be more connected with other
       influential individuals. Lastly, embeddedness quantifies         Fig. 2 and Fig. 3 shows the communities structure in the
       how isolatable an individual is or how involved in the           network for Zachary karate club and Dolphins social network
       network structure an individual is [20]. If all of an            respectively. We compared our community detection algorithm
       individual’s connections with other individuals are
                                                                        using leader nodes with other community detection algorithm:
       severed, the individual would be isolated. Thus, higher
                                                                        Label Propagation Algorithm (LPA) [22] and Leading
       embeddedness values mean that it is more difficult to
       isolate an individual [21].                                      Eigenvalue Algorithm (LEA) [23] using different metrics. For
                                                                        each network we calculate the quality of partition using the
                               Ax = 𝜆x         (1)                      modularity Q.
       With: A is the adjacency matrix of the network and 𝜆 is
       the eigenvalue.                                                                                                  (2)

     Nodes ranking: we rank the nodes by the high                      where the first term,          is the proportion of edges inside
      centrality score, and choose the leader Vl which is the
                                                                        the communities, and the second term              represents the
      node with the highest centrality.
                                                                        expected value of the same quantity in a random network
     Form community: we calculate neighborhood function                constructed by keeping the same node set and node degree
      to find the neighbors of the leader node which is the             distribution, but connecting the edges between nodes
      node with the highest centrality score. We assign                 randomly.
      neighbors to the detected leader node to form a                   Also to evaluate our algorithm, we use the Adjusted Rand
      community.                                                        Index, the measure penalizes false negatives and false
     We remove the community i.e. the leader node and its              positives. Let a,b,c and d denote the number of pairs of nodes
      neighbors from the network and we deal with the                   that are respectively in the same community in both G and R,
      second node with the highest centrality until all the             in the same community in G but in different communities in R,
      vertices (nodes) will be treated.                                 in different communities in G but in the same community in R,
                                                                        and in different communities in both G and R. Then the ARI is
                V. RESULTS & EVALUATIONS                                computed by the following formula:
To test our community detection using leader node algorithm,
we ran the proposed algorithm on two networks described                                                                       (3)
above:
    Zachary’s karate club network. This is a well-known
    benchmark network for testing community detection                   And we use the Normalized Mutual Information (NMI):
    algorithms. The network is made up of 34 nodes and 78
    edges, where every node represents a member of a karate
    club at an American university. If two members are                                                                  (4)
    observed to have social interactions within or away from
    the karate club, they are connected by an edge. Later,              where I(X,Y) The mutual information corresponds to the
    because of a dispute arising between the club’s                     quantity of information shared by the variables. Its lower
    administrator and instructor, the club is eventually split          bound is, representing the independence of the variables (they
    into two factions centered on the administrator and the             share no information). The upper bound corresponds to a
    instructor, respectively.                                           complete redundancy; however this value is not fixed.
                                                                        The table below presents the result of our algorithm and the
                    TABLE I:        DATASETS PROPERTIES                 Label Propagation Algorithm and Leading Eigenvector
                                                                        Algorithm using the cited metrics.
         Datasets         Nodes      Edges           Real Communities   The results in table 2 show that for Zachary Karaté Club dataset
     Zachary Karaté            34        78              2              our algorithm provides the best result for ARI and NMI
          Club                                                          comparing to LPA and LEA algorithms, while for the
    Dolphins Social            62        159             2              modularity that present the quality of founded clusters is quite
      networks                                                          good compared to LEA which provide the highest one.

   Lusseau’s bottlenose dolphin social network: This is
    ALSO A FAMOUS NETWORK WIDELY USED AS A BENCHMARK TO
    validate community detection algorithms. It contains 62


                                                                                                                                       26
                                                        TABLE II:         COMPARISON RESULTS OF ALGORITHMS.

                        Network                 Algorithm             Communities           Modularity          NMI           ARI


                       Zachary                   LPA                          2                0.132            0.002        -0.027
                      Karaté club                LEA                          4                0.393            0.006        -0.037
                                          Proposed algorithm                  3                0.318            0.216        0.255
                    Dolphins social                LPA                        4                0.519            0.555        0.445
                       network                     LEA                        5                0.491            0.539        0.344
                                          Proposed algorithm                 16                0.345            0.047        -0.025


  Figure 2.Community structure in Zachary Karaté Club provided from our algorithm where the leaders are represented by square, by LPA algorithm and LEA
                                                                  algorithm respectively.


Figure 3.Community structure in Dolphins Social Networks provided from our algorithm where the leaders are represented by square, by LPA algorithm and LEA
                                                                 algorithm respectively.

                                                                                  the research in this area is the fact that the dissemination of
                         VI. CONCLUSION                                           information i.e. the distribution of influence in complex
                                                                                  networks is an element both strategic and particularly sensitive
This paper presents a study of different detection algorithms                     to their use. Thus, we have proposed a new approach for
communities and especially the leader nodes in complex                            detecting communities using leaders’ nodes who unlike the
networks have become increasingly important given the                             proposed algorithms do not require a priori knowledge of k
scientific and industrial challenges it represents. The idea is to                nodes to detect leaders.
group objects based on certain criteria. The interest shown by


                                                                                                                                                             27
                              REFERENCES

[1]   G. F. de Arruda, A. L. Barbieri, P. M. Rodríguez, F. A. Rodrigues, Y.
     Moreno, and L. da F. Costa, “Role of centrality for the identification of
     influential spreaders in complex networks,” Phys. Rev. E, vol. 90, no. 3,
     p. 032812, Sep. 2014.
[2] Y. Wang, Z. Di, and Y. Fan, “Identifying and Characterizing Nodes
     Important to Community Structure Using the Spectrum of the Graph,”
     PLoS ONE, vol. 6, no. 11, p. e27418, Nov. 2011.
[3] P. Domingos and M. Richardson, “Mining the Network Value of
     Customers,” in Proceedings of the Seventh ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining, New York, NY,
     USA, 2001, pp. 57–66.
[4] D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the Spread of
     Influence Through a Social Network,” in Proceedings of the Ninth ACM
     SIGKDD International Conference on Knowledge Discovery and Data
     Mining, New York, NY, USA, 2003, pp. 137–146.
[5] H. Shen, X. Cheng, K. Cai, and M.-B. Hu, “Detect overlapping and
     hierarchical community structure in networks,” Physica A: Statistical
     Mechanics and its Applications, vol. 388, no. 8, pp. 1706–1712, Apr.
     2009.
[6] B. Renoust, “Analysis and Visualisation of Edge Entanglement in
     Multiplex Networks,” University of Massachusetts Lowell, 2014.
[7] H. R. Gor and M. V. Dhamecha, “A Survey on Community Detection in
     Weighted Social Network,” International Journal, vol. 2, no. 1, 2014.
[8] Q. Wu, X. Qi, E. Fuller, and C.-Q. Zhang, “Follow the Leader: A
     Centrality Guided Clustering and Its Application to Social Network
     Analysis,” The Scientific World Journal, vol. 2013, p. e368568, Oct.
     2013.
[9] P. Pons, Detection communities in real networks. Paris 7, 2010. (P. Pons,
     Détection de communautés dans les grands graphes de terrain. Paris 7,
     2010.)
[10] R. R. Khorasgani, J. Chen, and O. R. Zaïane, “Top leaders community
     detection approach in information networks,” in in Proceedings of the
     4th Workshop on Social Network Mining and Analysis, 2010. ISSN :
     2319-7323, 2013, p. 228.
[11] B. W. Kernighan and S. Lin, “An Efficient Heuristic Procedure for
     Partitioning Graphs,” Bell System Technical Journal, vol. 49, no. 2, pp.
     291–307, Feb. 1970.
[12] S. Fortunato, “Community detection in graphs,” Physics Reports, vol.
     486, no. 3–5, pp. 75–174, Feb. 2011.
[13] D. Shah and T. Zaman, “Community Detection in Networks: The
     Leader-Follower Algorithm,” arXiv:1011.0774 [physics, stat], Nov.
     2010.
[14] J. Zhou, Y. Zhang, and J. Cheng, “Preference-based mining of top-
     influential nodes in social networks,” Future Generation Computer
     Systems, vol. 31, pp. 40–47, Feb. 2014.
[15] Y. Xia, X. Ren, Z. Peng, J. Zhang, and L. She, “Effectively identifying
     the influential spreaders in large-scale social networks,” Multimed Tools
     Appl, pp. 1–13, Sep. 2014.
[16] M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi, “Measuring
     User Influence in Twitter: The Million Follower Fallacy.,” ICWSM, vol.
     10, pp. 10–17, 2010.
[17] Y. Wang, Z. Di, and Y. Fan, “Detecting Important Nodes to Community
     Structure Using the Spectrum of the Graph,” arXiv:1101.1703 [physics],
     Jan. 2011.


                                                                                 28