Proceedings of the International Conference on Big Data, Cloud and Applications Tetuan, Morocco, May 25 - 26, 2015 Towards a new approach for community detection algorithm in social networks Sara AHAJJAM, Hassan BADIR , Mohamed EL HADDAD Laboratory of Technologies and Communication National School of Applied Sciences Tangier, Morocco Ahajjamsara@gmail.com hbadir@gmail.com elhaddad.mohamed@gmail.com Abstract— In the recent years, social networks emerged campaigns. This fact caused many researchers to look for an rapidly and it's has become more complex. Social networks play efficient method for finding top-k most influential people an important role in the dissemination of information and the through social networks. spread of influence. Several research studies are interested to the We are interested to study the problematic of detection of detection of the structure of complex networks, otherwise, to the communities and leaders’ nodes in complex network. Those community detection and leader detection. The major drawback of most of the proposed algorithms is that they require knowledge nodes have high connectivity with the others nodes, and of number of communities to detect. Our approach proposes an represent an optimization of the network while maintaining the algorithm for the detection of communities in social networks, same characteristics of the network. The major drawback of especially the detection of leader nodes (influencer’s nodes) most of the proposed approaches is that they require without a priori knowledge of the number of communities or knowledge of k leader and communities to detect. In this leaders to detect. paper, we introduce a new approach to detect leaders’ nodes and communities in the network without a prior knowledge of Keywords—Community detection; leader node; centrality; k nodes to detect. This problem has many applications such as: complex networks; graph theory. opinion propagation, studying acceptance of political movements or acceptance of technology in economics. I. INTRODUCTION Actually, identifying influential nodes in networks, also Complex networks are a powerful tool for understanding the regarded as ranking important nodes has become one of the mechanisms of various systems. They are modeled by graphs three main problems in network-based information retrieval with vertices denote the actors of phenomenon and links and mining [2]. In biological systems, we might like to denote the interactions between vertices. These graphs identify the nodes that are keys to communities and protect represents different systems such as collaboration networks, them or disrupt them, such as in the case of lung cancer [2]. In citation network, protein interaction networks, WWW and epidemic spreading, we would like to find the important nodes social networks,.., etc. These networks are complex graphs to understand the dynamic processes, which could yield an with high local density and low overall density, they play a efficient method to immunize modular networks [2]. Such fundamental role in the diffusion of information, ideas and strategies would greatly benefit from a quantitative innovation, this advantage has been the subject of various characterization of the node importance to community parts that have moved towards these networks to achieve structure. For example, suppose that we need to advertise a advertising goals (ads on Facebook), educational (LinkedIn), product in a country or we need to propagate news. For this or political (Election of USA on Twitter). The key property of purpose, we need to choose some people as a starting point a real network is its community structure. The communities and maximize the news or the products influence in the target are groups of nodes, with more links connecting to nodes of society. The problem was introduced in [3] for the first time. the same group and comparatively fewer links connecting to After that in [4] the authors formalized the problem as follows: nodes of different groups. Recent studies have verified that the given a weighted graph in which nodes are people and edge way in which such nodes are organized plays a fundamental weights represent influence of the people on each other, it is role in spreading processes [1]. Study the influence of role desired to find k starting nodes that their activation leads to models can help us to better understand why some trends or maximum propagation In particular, we will focus our innovations are adopted more quickly than others and how we attention in one topological feature: centrality. Since those can help advertisers and marketers to design more effective central nodes can diffuse their influence to the whole network 23 faster than the rest of nodes and they are the most influential spreaders. (3) II. CENTRALITY MEASURES The variety of measures of centrality comes from the fact that Where: d(i,j ;g) is the geodesic distance between i and j. the importance of a node depends on other parameters such as connectivity and orientation in the graph and the nature of - Eigenvector Centrality: measurement of the entire network [5]. The work of Linton It simulates a mechanism in which each vertex affects all of Freeman is probably one of the most important contributions its neighbors simultaneously [16]. Eigenvector centrality is a to the analysis of social networks and networking in general. sort of extended degree centrality which is proportional to There are three varieties of measures of centrality [6], we’ll the sum of the centralities of the vertex’s neighbors. A vertex cite the main ones: has large value of eigenvector centrality score either if it is - Degree Centrality: connected to many other vertices or if it is connected to It is defined as the number of links incident upon a vertex others that themselves have high eigenvector centrality [17]. which means the number of edges a vertex has. For a graph G The eigenvector centrality score of the 𝑖th vertex in the :=( V, E) with n vertices, the degree centrality Cd (i,g) for network is defined as the 𝑖th component of the eigenvector vertex is: corresponding to the greatest eigenvalue of the following characteristic equation: X= 𝜆X(4) (1) Where: di(g) is the degree of the node i. Where: 𝐴 is the adjacency matrix of the network, 𝜆 is the largest eigenvalue of 𝐴, and 𝑥 is the corresponding - Betweenness Centrality: eigenvector. Vertices that occur on many shortest paths between other vertices have higher betweenness than those that do not. For a III. RELATED WORKS graph G :=( V, E) with n vertices, the betweenness Cb (i) for The community detection algorithms have been the subject of vertex is computed as follows: several research papers. Most studies classify articles and research methods depending on the type of the algorithm. The 1. For each pair of vertices (s, t), compute all shortest paths community detection algorithms are belonging to two main between them. types of approaches namely graph partitioning and 2. For each pair of vertices (s, t), determine the fraction of classification. The major drawback of methods based on the shortest paths that pass through the vertex in question (here, partitioning of graphs is that they require a prior knowledge of vertex v). the number and size of groups to determine [9]. Also, the 3. Sum this fraction over all pairs of vertices (s, t). leader detection approaches are divided to two mains types: global and local methods. The global method deals with all the The betweenness centrality is: network topology (betweenness centrality), while the local ones treat with local position, i.e. with the node (degree centrality). Reihaneh Rabbany Khorasgani et al. suggest a  new approach to detect leaders nodes that takes into account the nodes that are not associated with no leaders. This With : is the probability that i falls on a randomly algorithm is inspired from k-means, the k nodes to be detected will be randomly selected. Other nodes will be assembled at selected geodesic connecting k and j. their closest leaders to form communities, and then find new leaders for each community around which gather followers - Closeness Centrality: until no node moves. For each community, the centrality of In graph theory closeness is a centrality measure of a vertex each member is calculated and the node with the highest within a graph. Vertices that are 'shallow' to other vertices degree is chosen as the new leader [10]. Another algorithm of (that is, those that tend to have short geodesic distances to leaders’ nodes detection in complex networks proposed by other vertices within the graph) have higher closeness. Kernighan and Lin based on partitioning of graphs. This Closeness is preferred in network analysis to mean shortest- algorithm tries to find a section of the graph minimizing the path length, as it gives higher values to more central vertices, number of edges between partitions by trading vertices and so is usually positively associated with other measures between these partitions. The results of this algorithm are such as degree [7]. generated by introducing the size of each partition [11]. The results of these two algorithms vary according to the size and The closeness centrality is: number of partitions which are introduced. Other proposed studies use classification. The classification was introduced to 24 analyze the data and partition based on a measure of similarity large-scale social networks. The main idea, that it reduces the between partitions. The problem of communities detection can scale of network by eliminating the node located in the be seen as a problem of data classification for which we need peripheral layer (namely relatively small ks value) that will not to select an appropriate distance [12]. Indeed, the classification have much spreading potency comparing with the core node in methods are generally appropriate for some networks that have general, and vice versa. This algorithm uses the k- a hierarchical structure. The result obtained by these methods decomposition centrality to deal only with the nodes in the depends on choice of similarity measure that used initially. core of the network. Hence, it reduce the scale of the network Blondel et al. have proposed the Louvain method that put each by ignoring the nodes whose ks value is small and the links node in a vertex. Other approaches are based on partitioned connected them and retain the nodes in the core layers. At last, classification which is like the partitioning of the graph the global methods (i.e. betweenness centrality and closeness requires prior knowledge of size and number of communities centrality) are used to rank the most influential spreaders [15]. to detect. Another study focuses on the spectral classification. A novel approach to detect communities and important nodes In the Leader-Follower algorithm, we define some internal of the detected communities using the spectrum of the graph. structure of a community. A community should be a clique It defines the importance nodes to community as the relative and is formed of a leader and at least one "loyal follower" changes in the c largest eigenvalues of the network adjacency which is a node in the community without neighbors in any matrix upon their removal. It has two types of nodes, the core other community. The leader is a node whose distance is less nodes who are the central nodes and the most important for the than at least one of its neighbors. The nodes will be allocated community, and the bridges node who connect the to the community in which a majority of its neighbors belong communities to each other’s. The main drawback of this by destroying the links arbitrarily. However, parasites approach, it is that to have a better result, they need to know communities i.e. leaders without loyal follower assigned will the number of partitions in the network and it cannot identify be removed from the network. This can cause a loss of the important nodes in the small communities when the information [13]. Yunlong Zhang et al propose a greedy communities are in very different size has the same size. It algorithm based on user preferences (GAUP) to operate the cannot identify the important nodes in the small communities top-k influential users, based on the model Extended when the communities are in very different size [17]. Independent Cascade (EIC said that an active node v is active Community and leader nodes detection approaches are in t-1, has only one chance to activate all inactive neighbors). diverse. Each proposed algorithm brings a new idea or During each cycle i, the algorithm adds a record in the selected improvement of existing algorithms. We will propose a new set such that the vertex S with the current set S maximizes approach to detect communities and leader nodes in complex propagation of the influence. This means that the vertex networks without a priori knowledge of number of selected in round i is the one that maximizes the incremental communities to detect. propagation influence in this cycle. This algorithm calculates the user's preferences for different subjects, and combines IV. PROPOSED ALGORITHM traditional greedy algorithms and preferences calculated by Identifying social influence in networks is critical to LSI user and calculates an approximate solution of the understanding how behaviors spread. In order to detect the problem of maximizing the influence of a specific topic. This catalyst of this influence, we need to detect the central nodes algorithm provides a good result if k exceeds a certain that are responsible for the dissemination of influence. Analysis threshold k≥ 15 and it is of complexity O(n3) [14]. More on social network datasets reveals that in each community, recently, in [14], the authors derive an upper bound for the there is usually some member (or leader) who plays a key role spread function under the LT model. They propose an efficient in that community. In fact, centrality is an important concept UBLF algorithm by incorporating the bound into CELF. [13] within social network analysis, which measures the Recent research found that the location of the node in the relative importance of a vertex within the graph. Different from others methods, our approach detect leaders, and build network topology is another important factor when estimating communities around these leaders without a priori knowledge the spreading ability. According to that, [15] propose a new of k leader to detect. approach to identify the location of node through the k-shell decomposition method, by which the network is divided into Given an input dataset, the dataset is modeled as an several layers. Each node corresponding one layer and the undirected and unweighted graph 𝐺 = (𝑉, 𝐸). 𝑉 is the vertex entire network formed the core-periphery structure. K-shell set. Each vertex in 𝑉 represents an element in the dataset. |(𝐺)| decomposition method indicates that the inner the layer is, the represents the number of vertices in 𝐺 (or elements in the more important the node. However, in practical applications dataset). 𝐸 is the edge set. Each edge represents a relationship there are often too many nodes having the same index value by between a pair of elements. Our approach has three steps as in employing these two methods to distinguish which node is “Fig. 1”: more powerful. Generally speaking, DC and k-shell  Nodes centrality: For each node v in the network G, decomposition are suitable to measure the spreading ability of calculate the eigenvector centrality. Eigenvector nodes quickly but not very accurate. Another proposed centrality or Gould’s index of accessibility [17] is a algorithm use both global and local methods of centrality measure that describes how well connected an measures to effectively identifying the influential spreaders in individual isbased on direct and indirect relationships 25 (i.e., it takes into account the connections of the nodes that represent bottlenose dolphins living in Doubtful individuals the focal individual is connected to [18]. Sound, New Zealand, and 159 edges that represent Because eigenvector centrality is proportional to an associations between dolphin pairs observed to co-occur individual’s neighbors’ centralities[19], more influential more often than expected occasionally. individuals will be more connected with other influential individuals. Lastly, embeddedness quantifies Fig. 2 and Fig. 3 shows the communities structure in the how isolatable an individual is or how involved in the network for Zachary karate club and Dolphins social network network structure an individual is [20]. If all of an respectively. We compared our community detection algorithm individual’s connections with other individuals are using leader nodes with other community detection algorithm: severed, the individual would be isolated. Thus, higher Label Propagation Algorithm (LPA) [22] and Leading embeddedness values mean that it is more difficult to isolate an individual [21]. Eigenvalue Algorithm (LEA) [23] using different metrics. For each network we calculate the quality of partition using the Ax = 𝜆x (1) modularity Q. With: A is the adjacency matrix of the network and 𝜆 is the eigenvalue. (2)  Nodes ranking: we rank the nodes by the high where the first term, is the proportion of edges inside centrality score, and choose the leader Vl which is the the communities, and the second term represents the node with the highest centrality. expected value of the same quantity in a random network  Form community: we calculate neighborhood function constructed by keeping the same node set and node degree to find the neighbors of the leader node which is the distribution, but connecting the edges between nodes node with the highest centrality score. We assign randomly. neighbors to the detected leader node to form a Also to evaluate our algorithm, we use the Adjusted Rand community. Index, the measure penalizes false negatives and false  We remove the community i.e. the leader node and its positives. Let a,b,c and d denote the number of pairs of nodes neighbors from the network and we deal with the that are respectively in the same community in both G and R, second node with the highest centrality until all the in the same community in G but in different communities in R, vertices (nodes) will be treated. in different communities in G but in the same community in R, and in different communities in both G and R. Then the ARI is V. RESULTS & EVALUATIONS computed by the following formula: To test our community detection using leader node algorithm, we ran the proposed algorithm on two networks described (3) above: Zachary’s karate club network. This is a well-known benchmark network for testing community detection And we use the Normalized Mutual Information (NMI): algorithms. The network is made up of 34 nodes and 78 edges, where every node represents a member of a karate club at an American university. If two members are  (4) observed to have social interactions within or away from the karate club, they are connected by an edge. Later, where I(X,Y) The mutual information corresponds to the because of a dispute arising between the club’s quantity of information shared by the variables. Its lower administrator and instructor, the club is eventually split bound is, representing the independence of the variables (they into two factions centered on the administrator and the share no information). The upper bound corresponds to a instructor, respectively. complete redundancy; however this value is not fixed. The table below presents the result of our algorithm and the TABLE I: DATASETS PROPERTIES Label Propagation Algorithm and Leading Eigenvector Algorithm using the cited metrics. Datasets Nodes Edges Real Communities The results in table 2 show that for Zachary Karaté Club dataset Zachary Karaté 34 78 2 our algorithm provides the best result for ARI and NMI Club comparing to LPA and LEA algorithms, while for the Dolphins Social 62 159 2 modularity that present the quality of founded clusters is quite networks good compared to LEA which provide the highest one.  Lusseau’s bottlenose dolphin social network: This is ALSO A FAMOUS NETWORK WIDELY USED AS A BENCHMARK TO validate community detection algorithms. It contains 62 26 TABLE II: COMPARISON RESULTS OF ALGORITHMS. Network Algorithm Communities Modularity NMI ARI Zachary LPA 2 0.132 0.002 -0.027 Karaté club LEA 4 0.393 0.006 -0.037 Proposed algorithm 3 0.318 0.216 0.255 Dolphins social LPA 4 0.519 0.555 0.445 network LEA 5 0.491 0.539 0.344 Proposed algorithm 16 0.345 0.047 -0.025 Figure 2.Community structure in Zachary Karaté Club provided from our algorithm where the leaders are represented by square, by LPA algorithm and LEA algorithm respectively. Figure 3.Community structure in Dolphins Social Networks provided from our algorithm where the leaders are represented by square, by LPA algorithm and LEA algorithm respectively. the research in this area is the fact that the dissemination of VI. CONCLUSION information i.e. the distribution of influence in complex networks is an element both strategic and particularly sensitive This paper presents a study of different detection algorithms to their use. Thus, we have proposed a new approach for communities and especially the leader nodes in complex detecting communities using leaders’ nodes who unlike the networks have become increasingly important given the proposed algorithms do not require a priori knowledge of k scientific and industrial challenges it represents. The idea is to nodes to detect leaders. group objects based on certain criteria. The interest shown by 27 REFERENCES [1] G. F. de Arruda, A. L. Barbieri, P. M. Rodríguez, F. A. Rodrigues, Y. Moreno, and L. da F. Costa, “Role of centrality for the identification of influential spreaders in complex networks,” Phys. Rev. E, vol. 90, no. 3, p. 032812, Sep. 2014. [2] Y. Wang, Z. Di, and Y. Fan, “Identifying and Characterizing Nodes Important to Community Structure Using the Spectrum of the Graph,” PLoS ONE, vol. 6, no. 11, p. e27418, Nov. 2011. [3] P. Domingos and M. Richardson, “Mining the Network Value of Customers,” in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2001, pp. 57–66. [4] D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the Spread of Influence Through a Social Network,” in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2003, pp. 137–146. [5] H. Shen, X. Cheng, K. Cai, and M.-B. Hu, “Detect overlapping and hierarchical community structure in networks,” Physica A: Statistical Mechanics and its Applications, vol. 388, no. 8, pp. 1706–1712, Apr. 2009. [6] B. Renoust, “Analysis and Visualisation of Edge Entanglement in Multiplex Networks,” University of Massachusetts Lowell, 2014. [7] H. R. Gor and M. V. Dhamecha, “A Survey on Community Detection in Weighted Social Network,” International Journal, vol. 2, no. 1, 2014. [8] Q. Wu, X. Qi, E. Fuller, and C.-Q. Zhang, “Follow the Leader: A Centrality Guided Clustering and Its Application to Social Network Analysis,” The Scientific World Journal, vol. 2013, p. e368568, Oct. 2013. [9] P. Pons, Detection communities in real networks. Paris 7, 2010. (P. Pons, Détection de communautés dans les grands graphes de terrain. Paris 7, 2010.) [10] R. R. Khorasgani, J. Chen, and O. R. Zaïane, “Top leaders community detection approach in information networks,” in in Proceedings of the 4th Workshop on Social Network Mining and Analysis, 2010. ISSN : 2319-7323, 2013, p. 228. [11] B. W. Kernighan and S. Lin, “An Efficient Heuristic Procedure for Partitioning Graphs,” Bell System Technical Journal, vol. 49, no. 2, pp. 291–307, Feb. 1970. [12] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3–5, pp. 75–174, Feb. 2011. [13] D. Shah and T. Zaman, “Community Detection in Networks: The Leader-Follower Algorithm,” arXiv:1011.0774 [physics, stat], Nov. 2010. [14] J. Zhou, Y. Zhang, and J. Cheng, “Preference-based mining of top- influential nodes in social networks,” Future Generation Computer Systems, vol. 31, pp. 40–47, Feb. 2014. [15] Y. Xia, X. Ren, Z. Peng, J. Zhang, and L. She, “Effectively identifying the influential spreaders in large-scale social networks,” Multimed Tools Appl, pp. 1–13, Sep. 2014. [16] M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi, “Measuring User Influence in Twitter: The Million Follower Fallacy.,” ICWSM, vol. 10, pp. 10–17, 2010. [17] Y. Wang, Z. Di, and Y. Fan, “Detecting Important Nodes to Community Structure Using the Spectrum of the Graph,” arXiv:1101.1703 [physics], Jan. 2011. 28