International Conference on Information and Communication Technology and Its Applications (ICTA 2016) Federal University of Technology, Minna, Nigeria November 28 – 30, 2016 Community Detection in Networks: Algorithms and Challenges Elizabeth N. Onwuka, Bala A. Salihu, and Paschal S. Iornenge Department of Telecommunication Engineering, Federal University of Technology, Minna, Nigeria Abstract—Community detection in networks is a research area connecting a pair of vertices, showing that some that is gaining a lot of attention mostly because of its great relationships are stronger than others [4]. applications in areas such a coding, link prediction, routing, In this work, we provide a brief survey on some of the containment of virus and warm online, and recently for works done in community detection, discuss a few issues and criminal detection. In this era of Big Data, it is envisaged that challenges these approaches are yet to completely take care community detection will be handy in solving many societal of. We begin by briefly discussing some of the metrics used problems. Many algorithms have been developed to solve the by researchers for community detection in section II, then we complex problem of clustering groups of nodes in a network. will discuss some of the algorithms and techniques used to In this paper, a few of these algorithms and their challenges detect communities of users within a network in section III. are discussed. Directions for future research in this area are also pointed out. We will do a comparison of these methods in section IV, and conclude by recommending considerations for future Keywords-community detection; social network graphs; research in community detection. binary networks, weighted networks I. INTRODUCTION II. METRICS USED IN COMMUNITY DETECTION The massive improvement in technology over the years, A. Centrality Metrics especially the technical and commercial success of electronic communications devices has made communication and Centrality measures are used to depict the level of interactions between people very easy. The widely growing importance or standing of a given node in relation to other attention towards research in big data has led to significant nodes in a network or community. These include; moves and potential moves towards modeling human Betweeness centrality (BC), which measures the tendency of behaviour and understanding the nature of their relationships a node to be found along the shortest path between two other through large amounts of data collected over time. A nodes. A node with high BC is important in a network network is a group of nodes (or vertices) connected through because it serves as an important route for information flow edges or links. A community in a network is a group of in that network, it means that removal of such node will nodes having more internal connections with each other than either collapse the network or weaken it considerable. external connections with the rest of the network [1]. They Closeness centrality (CC) is a measure of the sum of all the are also called Clusters, Cliques or Cohesive groups [2], [3]. shortest paths between that node and other nodes in the graph Communities can overlap, meaning, members of one i.e. it measures how close a node is to other nodes in the community or clique can also be part of another community network; Degree centrality (DC) measures the level of or clique. connectedness of a node i.e., it gives a measure of how many Community detection is commonly carried out with the nodes are directly connected to a given node in relation to all use of social network graphs. Social Network graphs may be other nodes in a network. Eigenvector centrality which directed or undirected, weighted or unweighted etc. In a shows to what extent a node is connected to other well- directed graph, the direction of communication between two connected nodes, this metric gives the intuitive reasoning nodes (vertices) is considered, that is, the edges connecting that an important node will usually be connected to other the vertices have directions associated to them. Directed important nodes. It has a google variant called Pagerank graphs are called digraphs if no multiple edges connect the [5][6]. These metrics play important roles in determining vertices and multigraphs if multiple edges connect the nodes. which nodes belong to certain communities and how Undirected graphs are made up of unordered pairs of important or influential such nodes are [7]. vertices, i.e., direction of communication is not important. Also, an undirected graph is unweighted (or binary) if a B. Other Metrics single edge connects each pair of vertices. In this case, it is Other important metrics include Clustering coefficient of only important to know if two vertices communicate or not. a node, which is the probability that any two random The extent or frequency of communication is not important. neighbours of a given node, chosen at random are, However, for weighted graphs, there can be multiple edges connected; belonging degree, conductance and modularity. 208 International Conference on Information and Communication Technology and Its Applications (ICTA 2016) 1) Belonging Degree terrorist networks, for instance, it is possible for one group of Assuming C is a community in a network; for a people not to know the others [9] e.g. as reported in [10] on node; 𝑢 ∈ 𝑉; 𝑘𝑢 , 𝑁𝑢 ,are node degrees and neighbor sets the Sept. 9/11 hijackers, those trained to fly may not know respectively. And let 𝑤𝑢𝑣 be the weight of the link between the people on ground. This makes the task of detecting and nodes u and v (where v is already in the community). analyzing such networks much trickier using the conventional means of network detection. ku   wuv (1) Many researchers have undergone the task of coming up uNu with better improved methods to detect communities and analyze behaviour on social graphs. Most of the earlier works were based on binary detection schemes, since many For the community C, and node u, the belonging degree natural networks are biological networks. 𝐵 𝑢, 𝐶 of node u to community C is defined as: w A. Binary Networks uv Many natural networks are biological networks, which B  u, C   uC (2) are usually binary. In such networks, it is only important if ku two nodes are either connected or not connected. The relative strengths of connections between nodes do not matter, a It shows to what extent a given node belongs to a typical binary network is shown in Fig 1. community [4]. Most of the earlier algorithms for community detection were based on binary networks. Very prominent among them 2) Conductance is one proposed by Girvan & Newman [11] who focused on It measures the fraction of total edge volume that point the boundaries of communities rather than their core. It is outside the cluster [4]. It measures how well knit a graph is. said to be the first algorithm in modern age community The lower the conductance valueɸ(C), the more connected detection [12]. In their approach, edges are removed from the the nodes are. network based on Betweeness centrality values. The edges with the highest Betweeness centrality are removed, cut (C , C / G )  C   (3) Betweeness is calculated again for the edges affected by this wc removal, and the process is repeated until no edges remain. Where 𝑐𝑢𝑡(𝐶, 𝐶 𝐺)represents the number of cut edges in Even though it worked considerably well even when tested the community (which represents all edges leaving the on real life networks such as the Zachary Karate club [13], it community), and 𝑤𝑐 is the total weight of edges in the could not be used to detect communities with weighted edges and the run time of the algorithm as the number of nodes community. increase makes it unsuitable for large graphs. 3) Modularity It is based on the idea that a good cluster should have a higher internal and lower external density of edges compared to a null model with similar structural properties but without a community structure k Q   (eii  ai 2 ) (4) i 1 where, 𝑒𝑖𝑖 is the fraction of weights of edges belonging to community i, while 𝑎𝑖 is the fraction of all edges connecting community iwith other communities. It is a measure of network partition and it shows the quality of the community structure in the network [8]. Figure 1. A diagram showing the structure of a binary network Cfinder was developed to uncover the structure of III. COMMUNITY DETECTION complex networks by analyzing the statistical features of The goal of community detection is to partition a network overlapping networks [14]. In their work, a community (a k- into dense regions of the graph. Each region represents a clique community) was defined as a union of all k-cliques group of nodes that are closely related, and hence are in the (complete subgraphs of size k) that can be reached from each same community. A lot has been done in recent years to other through a series of adjacent k-cliques (where adjacency detect communities of users from large social or mobile means sharing nodes). It was based on the fact that members networks. Insecurity issues all over the world have made can be reached through well connected subsets of nodes. most of the approaches to find great applications in criminal This approach allowed members of one clique to possibly network detection. While detection could be a little easy in also be found in another clique (overlapping) which made it most cases, certain networks, especially covert networks of better than the divisive and agglomerative methods which criminals, contain highly complex structures of form many communities by allowing each node to remain a communication that cannot be very easily detected. In member of only one community but cuts it off from its other 209 International Conference on Information and Communication Technology and Its Applications (ICTA 2016) communities. The community detection was done by setting connections to a particular node are not equal) so community a threshold weight for the links and ignoring links that were detection is more reliable when the actual extent of below this threshold weight making it essentially a binary interaction between nodes is considered rather than just network. The value of the threshold weight is lowered until friendship between nodes [17]. the size of the largest community is twice as much as the second largest one. Thus making sure that there are no giant 0.9 communities by merging smaller communities. When the 0.4 initial community is already unweighted, there is no need for a threshold weight so the smallest value of k for which no 0.55 giant communities appear is selected. 0.75 0.6 0.5 The RAK algorithm (Name derived from the surnames of 0.8 the researchers: Raghavan, Albert and Kumara) which is 0.28 based on label propagation was also proposed in [15]. In 0.95 their approach each node is first initialized to a unique label 0.36 which represents the community it belongs to, and these 0.33 labels then propagate through the network. A node would determine its community based on the labels of its neighbours. Each node joins a community which has the most of its neighbours as members and the labels of the nodes are updated at each iteration. As the propagation continues, dense connected groups of nodes finally settle for a unique label, and in the end, all nodes with the same labels are placed in the same community. This continues until each node in the network has the label to which the maximum Figure 2. A diagram showing the structure of a weighted network number of its neighbours belong to (or one of the labels used A notable algorithm for detection of communities in by a maximum number of its neighbours), not when the weighted networks is the COPRA algorithm [18]. This labels no longer change since it is possible to have nodes algorithm is based on the label propagation algorithm (RAK) with equal maximum number of neighbours with two or discussed earlier. However, in this case, a node can be a more labels. Like the Cfinder, this algorithm also worked member of more than one community, and this algorithm can very well when tested with datasets from the Zachary karate also handle weighted networks. The method uses a belonging club as well as the U.S. football network. The fact that it is coefficient which shows the strength of a node’s membership possible for the iteration to end with two disconnected 1 groups of nodes having the same label and requiring a to a community, which is set to a threshold value , where v 𝑣 breadth-first search on the subnetwork of each individual is a factor in the algorithm which also indicates the group to separate the disjointed communities increases the maximum number of communities the given node can computation time and complexity of the technique. belong to and the belonging coefficient of each label sum to Furthermore, in [5] a criminal network graph of 43, 000 one. Instead of having just one label, community identifiers nodes was constructed and analyzed from 1,000 publicly are used and a node is allowed to keep more than one leaked email addresses of predominantly Nigerian advanced community identifier in each label without retaining all of fee fraud scammers using the Pagerank algorithm. The data them. During each propagation, the node labels are was collected by getting the list of friends and friends of constructed and the nodes with belonging coefficients less friends with the help of some of these email addresses on than a given threshold are deleted. If it happens that all the Facebook. The graph visualization was carried out by a pairs in a node have a belonging coefficient less than the method known as Force Atlas 2 on Gephi. Pagerank threshold, the one with the maximum belonging coefficient algorithm and centrality measures such as Betweeness (or one of two or more pairs with maximum belonging Centrality, Closeness Centrality, Eigenvector Centrality and coefficient chosen at random) is retained. This algorithm had Degree Centrality were used to identify key actors in the a huge advantage of allowing overlapping community network and form communities of criminals. Link nodes structures and ability to detect communities on a weighted between communities were also determined through this network. However, just like the RAK algorithm, the COPRA process. Relative importance of communities could be algorithm does not always converge to a constant state where determined by the number of top players in each community. the node labels no longer change after each iteration, thus The method showed 5% possibility of them being scammers bringing in more complexity as well. and 15% possibility of them being members of scammer In [19] a method for finding communities of users by communities for a random sample of 100 nodes. first identifying core nodes and finding cliques around those core nodes was proposed. They argued that having global B. Weighted Networks knowledge of the graph required by most algorithms was All of the methods briefly discussed earlier focused on unrealistic for very large graphs. A unique feature of their binary networks. In such methods, attributes of nodes are approach is that it is not sensitive to the position of the emphasized instead of the edge content which represent the source node. Their method applied the Girvan & Newman actual link between the nodes. Even though more (GN) algorithm [11] to detect all communities in the challenging, edges provide a richer characterization of network, then the nodes with maximal degrees among the community behaviour [16]. Most networks are weighted (all communities are found (i.e., core nodes) and the 210 International Conference on Information and Communication Technology and Its Applications (ICTA 2016) communities are expanded from these core nodes by finding community detection. Each of these algorithms, as discussed, the most likely nodes closer to the core node using node could be useful depending on the nature of the problem degrees. being addressed, and the type or size of network in use as The Strength algorithm [8] has a strategy which is to well. However, it was observed that no algorithm so far has find an initial partial community with maximal node strength detected communities with consideration for the strengths of and expand by adding tight nodes to the partial community indirect links between nodes that may either connect directly until detection is complete for that particular community. as well or have no direct connection at all. Research is The algorithm consists of two parts: finding initial currently ongoing to explore the possible improvement this community, and expanding the community. To find the may impact on the earlier kinds of networks. initial community, the node with the highest node strength (sum of all weights of connections between the node and its TABLE I. COMMUNITY DETECTION ALGORITHMS neighbours) is chosen along with all its neighbours as an Algorithm Metric/Technique Weighted Overlapping initial community. Any node with a belonging degree less used than a threshold (chosen as 0.5) is not connected enough to Girvan & Betweeness No No the community and is thus removed from it. To expand, the Newman Centrality belonging degree for all neighbours of the community are CFinder Use of threshold No Yes weight calculated. The neighbours with belonging degrees up to the RAK Label propagation No No threshold (0.5) are automatically added to the community, COPRA Label propagation Yes Yes while those with belonging degrees less than 0.5 but 0.4 or Strength Belonging Yes Yes more are added to the community only if adding them to the Coefficient community increases the value of the modularity. The Conductance- Belonging degree Yes Yes expansion stops when no neighbour of the community meets based and conductance these criteria. It also greatly supports overlapping communities. The algorithm however, degrades in its REFERENCES performance when the overlapping increases. In [4] a conductance-based algorithm which was an [1] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3–5, pp. 75–174, Feb. 2010. enhancement of the Strength algorithm was developed. The algorithm is just like the Strength algorithm only that a new [2] S. P. Borgatti, “2-Mode concepts in social network analysis,” Encyclopedia of complexity and system science, vol. 6, 2009. objective function, Conductance, is used in addition to the [3] G. Palla, I. Derényi, I. Farkas, and T. Vicsek, “Uncovering the belonging degree, and also, the method of selecting an initial overlapping community structure of complex networks in nature and community is slightly different. Here the initial community society,” Nature, vol. 435, no. 7043, pp. 814–818, Jun. 2005. is a community of two nodes in the network with the highest [4] Z. Lu, Y. Wen, and G. Cao, “Community detection in weighted edge weight between the two of them. To expand, the networks: Algorithms and applications,” in Pervasive Computing and belonging degrees of all neighbours of the community are Communications (PerCom), 2013 IEEE International Conference on, calculated and the neighbour with the highest belonging 2013, pp. 179–184. degree (which has the most likelihood of belonging to the [5] H. Sarvari, E. Abozinadah, A. Mbaziira, and D. Mccoy, “Constructing and Analyzing Criminal Networks,” 2014, pp. 84–91. community) is temporarily added to the community instead of using a threshold value. If the conductance of this new [6] M. Rahman, “Application of social networking algorithms in program analysis: understanding execution frequencies,” Colorado State temporary community is less than that of the initial University. Libraries, 2007. community, the new node is permanently added to the [7] M. Ahsan, T. Singh, and M. Kumari, “Influential node detection in community and the expansion process is repeated until social network during community detection,” in Cognitive Computing adding a new node with the highest belonging degree from and Information Processing (CCIP), 2015 International Conference the neighbours of a community gives a higher conductance on, 2015, pp. 1–6. than that of the present community. This is based on the fact [8] D. Chen, M. Shang, Z. Lv, and Y. Fu, “Detecting overlapping that a stronger or tighter community has a lower conductance communities of weighted networks via a local algorithm,” Physica A: Statistical Mechanics and its Applications, vol. 389, no. 19, pp. 4177– than a weaker one. 4187, Oct. 2010. [9] S. T. Smith, K. D. Senne, S. Philips, E. K. Kao, and G. Bernstein, IV. SUMMARY “Covert network detection,” Lincoln Laboratory J, vol. 20, no. 1, pp. 47–61, 2013. Here, we compare the key algorithms mentioned in the [10] V. E. Krebs, "Mapping Networks of Terrorist Cells," Connections, previous sections by looking at their ability to detect vol. 24, no. 3, pp. 43-52, 2002 communities on weighted graphs, ability to have overlapping [11] M. Girvan and M. E. Newman, “Community structure in social and nodes in communities, and the technique or metric applied. biological networks,” Proceedings of the national academy of This summary is presented in Table I. sciences, vol. 99, no. 12, pp. 7821–7826, 2002. [12] S. Fortunato and A. Lancichinetti, “Community detection algorithms: V. CONCLUSION a comparative analysis: invited presentation, extended abstract,” in Proceedings of the Fourth International ICST Conference on This work has briefly discussed community detection in Performance Evaluation Methodologies and Tools, 2009, p. 27. networks. It defined some of the metrics used to measure [13] W. W. Zachary, “An information flow model for conflict and fission performances in community detection. It also gave a brief in small groups,” Journal of anthropological research, pp. 452–473, survey of some community detection algorithms in literature, 1977. pointing out their strengths and limitations. For example, the [14] S. Bell, A. McDiarmid, and J. Irvine, “Nodobo: Mobile phone as a result in Table I shows the discussed algorithms used for software sensor for social network research,” in Vehicular 211 International Conference on Information and Communication Technology and Its Applications (ICTA 2016) Technology Conference (VTC Spring), 2011 IEEE 73rd, 2011, pp. 1– in Proceedings of the fourth SNA-KDD Workshop, KDD 2010, July, 5. 2010, vol. 25, pp. 1–9. [15] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time [18] S. Gregory, “Finding overlapping communities in networks by label algorithm to detect community structures in large-scale networks,” propagation,” New Journal of Physics, vol. 12, no. 10, p. 103018, Physical Review E, vol. 76, no. 3, Sep. 2007. Oct. 2010. [16] G.-J. Qi, C. C. Aggarwal, and T. Huang, “Community detection with [19] Z. Lu, Y. Wen, and G. Cao, “Community detection in weighted edge content in social media networks,” in 2012 IEEE 28th networks: Algorithms and applications,” in Pervasive Computing and International Conference on Data Engineering, 2012, pp. 534–545. Communications (PerCom), 2013 IEEE International Conference on, [17] M. Ovelgönne, A. Geyer-Schulz, and M. Stein, “Randomized greedy 2013, pp. 179–184. modularity optimization for group detection in huge social networks,” 212