Cliques are Too Strict for Representing Communities: Finding Large k-plexes in Real Networks Alessio Conte2 , Donatella Firmani1 , Caterina Mordente3 , Maurizio Patrignani1 , and Riccardo Torlone1 1 Roma Tre University, donatella.firmani@uniroma3.it, {patrigna,torlone}@dia.uniroma3.it 2 National Institute of Informatics, Japan, conte@nii.ac.jp 3 Be Think Solve Execute, c.mordente@be-tse.it Abstract. k-plexes are a formal yet flexible way of defining commu- nities in networks. They generalize the notion of cliques and are more appropriate in most real cases: while a node of a clique C is connected to all other nodes of C, a node of a k-plex may miss up to k connec- tions. Unfortunately, computing all maximal k-plexes is a gruesome task and state-of-the-art algorithms can only process small-size networks. In this paper we propose a new approach for enumerating large k-plexes in networks that speeds up the search by several orders of magnitude, leveraging on efficient techniques for the computation of maximal cliques. 1 Introduction In the vast majority of networks representing real-world scenarios the distribu- tion of edges is not uniform and it is often possible to clearly distinguish groups of nodes that are highly connected. The automatic detection of these groups, often called communities, helps to discover fundamental properties of large net- works in a variety of different domains. For this reason this problem has been largely investigated [14]. A clique is a set of nodes in a network with all possible edges among them, and is a formal and strict way of defining a community. So strict, in fact, that cliques are generally thought to be too rigid to be used in practice [17]. A more appropriate notion in many practical cases is the k-plex : a set of nodes such that each of them has edges with all the others, with the possible exception of up to k missing neighbors (including itself). So, for exam- ple, for k = 1, k-plexes are cliques, for k = 2, each node may miss one edge, etc. Hence, k-plexes are a simple and intuitive generalization of cliques. The problem of finding k-plexes arises in social network analysis [4], but it has wider applicability in several important areas employing graph-based data SEBD 2018, June 24-27, 2018, Castellaneta Marina, Italy. Copyright held by the author(s). mining [19]. Unfortunately, the detection of all maximal k-plexes is unpractical being hindered by two main problems: (i) maximal k-plexes are even more nu- merous than maximal cliques, even if most k-plexes are small and not significant; (ii) the most efficient algorithms in the literature, such as [6], can only be used on small-size graphs: our experiments shows that the largest networks we were able to analyze with such algorithms have a few hundred nodes. In this paper we propose a solution to the first issue that is also a solution to the second one. Namely, if we restrict the search to large k-plexes, which are the most meaningful in practice, we can devise efficient algorithms to detect them. Indeed, computing all maximal k-plexes does not make sense when the pur- pose is that of detecting communities. In this respect, it is useful to focus on the relationship between s, the size of a k-plex, and k itself. Starting from k = 1, which corresponds to cliques, if we increase the value of k, we obtain progressively sparser communities that are clearly less interesting in practice. (For s ≤ k, a k-plex can be composed of isolated nodes, and there exist many disconnected k-plexes for s < 2k.) In this framework, our strategy for finding large k-plexes relies on two main observations. First, the complexity of the problem can be reduced in the vast majority of cases on the basis of certain properties of large k-plexes that can be efficiently checked and that allows us to filter out a large portion of the network before starting their search. The second consideration is that, differently to what happens for k-plexes, the state-of-the-art techniques to compute all maximal cliques are able to scale up to millions of nodes by de- composing the network into small blocks [12]. Unfortunately, the decomposition approach cannot be easily adapted to the detection of k-plexes. However, we demonstrate that we can find all k-plexes non-smaller than m by looking in the neighborhood of cliques of a size that depends on k and m. The knowledge of cliques in a network provides a hint for finding all the significant k-plexes. Structure of this paper. Section 2 contains an informal overview of our ap- proach. Detailed description of our algorithms and their theoretical basis can be found in [10]. Main experimental results of [10] are reported in Section 3. Finally, Sections 4 and 5 contain related work and our concluding remarks. 2 Overview Our approach is based on two main ideas: (i) before starting the search of k- plexes, we can filter out a relevant portion of the network in which necessary conditions for the presence of large k-plexes do not hold, and (ii) in large net- works, cliques can drive the search of k-plexes. While the first point provides an effective way to simplify the problem at hand, the second can lead to an efficient strategy for finding k-plexes. Let us elaborate on these ideas starting with the problem of finding all k-plexes of maximum size. Assume that we have computed all the maximal cliques of a network and let ω be the size of the maximum clique. Then, a maximum k-plex has size at least ω, since cliques are also k-plexes. For We recall that the problem of enumerating maximal k-plexes is NP-Hard. coreness=2 cliqueness=3 coreness=4 coreness=4 cliqueness=3 cliqueness=5 c c b b d d g a g a f f e e (b) (c) coreness=3 cliqueness=2 (a) Fig. 1: An example network. example, suppose we are searching for k-plexes in the network in Fig. 1a, which we will use as a running example in this section: we have that ω = 5, since the maximum clique (the blue subgraph on the left hand side) involves five nodes. Filtering. At this point, we observe that two filtering criteria can be applied. 1. coreness. Our first intuition follows from the very definition of k-plex: all the nodes of a k-plex of size m must have degree non-smaller than m − k. If we know that the size of a maximum k-plex is at least ω, this means that we can iteratively filter out any node that has degree lower than ω − k. This corresponds to computing the coreness of all the nodes of the network (Lemma 2 of [10]), a process that can be executed in linear time [5]. For example, suppose we are searching for 2-plexes in the running example of Fig. 1a for which ω = 5: we can filter out the three black nodes on the top of the picture since they have coreness 2, which is less than ω − k = 3. In larger networks, this criterion allows us to cut up to 99% of the nodes. 2. cliqueness. The second intuition is that any node of a k-plex of size m must be included in a clique of a size that depends on m. This is confirmed by Corollary 5.5 of [10] stating that any node of a k-plex larger or equal to m is included in a clique of size at least dm/ke. Then, if the size of the maximum k-plex is at least ω, we can cut out all nodes that do not belong to any clique of size at least dω/ke. For example, if we are searching for 2-plexes in the network depicted in Fig. 1a, we can filter out all nodes that do not belong to cliques of size at least d5/2e = 3, that is, the pair of black nodes in the bottom of the network. In large instances in our experiments this criterion, can be tested efficiently and is able to cut up to 98% of the nodes. Even if some nodes can be filtered out both because their low cliqueness and low coreness, the network in Fig. 1a shows that the two filtering criteria are indeed independent. When both criteria are applied, the size of the network is reduced of magnitude and standard k-plexes algorithms may become feasible. Once we have found the largest k-plexes, we may be interested into searching smaller ones. We have noted in the introduction that an exhaustive search does Graph n density ω type −1 jazz 198 1.41 · 10 30 collaboration grQc 5.241 1.05 · 10−3 44 collaboration geom 6.158 6.28 · 10−4 22 collaboration advogato 7.418 1.75 · 10−3 19 collaboration hepPh 12.006 1.64 · 10−3 239 collaboration astroPh 18.771 1.12 · 10−3 57 collaboration newm 22.015 2.42 · 10−4 3 collaboration mathSci 391.529 1.14 · 10−5 25 collaboration dblp 511.163 1.43 · 10−5 115 collaboration patents 3.8 M 2.32 · 10−6 11 citation Table 1: Real-world networks in our experiments. Networks are sorted by size n. not make much sense, since very small k-plexes are not significant, to the point that they may be even disconnected or composed by a set of isolated nodes. Hence, the second problem we tackle is to find all maximal k-plexes in the network of size bigger than a threshold m. Local search. As mentioned above, our idea is to start from cliques, which are k-plexes but not necessarily maximal, and possibly enlarge them to find maximal k-plexes. Building on the cliqueness criterion, which ensures that each node of a k-plex C of size s is included into a clique of size at least ds/ke, we start from each of such cliques K. If we set m ≥ k 2 , we have that |K| ≥ ds/ke > k, which implies that any other node of C must be adjacent to at least one node of K (in other words, K is a dominating set of C). Hence, we can search for C restricting to a block including K and all its adjacent nodes. For example, suppose you are searching for all maximal 2-plexes of size at least 5 in the network in Fig. 1a. Consider any clique of size at least ds/ke = d5/2e = 3, for example the clique K = {a, b, e} (yellow triangle in Fig. 1b). The nodes of any k-plex of size at least 5 containing K are adjacent to K (surrounded nodes of Fig. 1b). We further reduce the size of the block by proving that C can be obtained by considering only nodes belonging to K and to other cliques of size at least ds/ke intersecting with K (Lemma 5 of [10]). For example, the rightmost 2-plex of size 6 of Fig. 1a is all contained into the clique {a, b, e} and three other cliques of size 3 intersecting with K (surrounded nodes of Fig. 1c). This gives rise to an efficient searching algorithm that decomposes the network into blocks each composed of one clique as the core, and all intersecting cliques as the boundary. Each block can be separately processed, possibly in a distributed environment. 3 Experiments In this section, we compare our and previous algorithms over different real-world networks, and show the advantages and limitations of our approach. We use the algorithm in [12] for enumerating all the maximal cliques of the input graph G, and the algorithm in [6] for enumerating all the maximal k-plexes of the targeted graph block. Our code is publicly available [1]. We considered a mix of k=2 k=3 k=4 n k=2 k=3 k=4 n 1⋅107 1⋅107 1⋅106 1⋅106 #nodes in G' #nodes in H 1⋅105 1⋅105 1⋅104 1⋅104 1⋅103 1⋅103 1⋅102 1⋅102 1⋅101 jazz 1⋅101 grQc geom hepPh newm dblp grQc dblp patents jazz geom hepPh newm patents advogato astroPh mathSci advogato astroPh mathSci (a) (b) Fig. 2: (a) Number of nodes (log scale) in the sub-graph H, setting m = ω. (b) Number of nodes (log scale) surviving the coreness criterion. real-world networks with various sizes and characteristics. All our networks are publicly available on the LASAGNE meta-repository [2], and come from different human activities. The networks considered in our experiments, together with their number of nodes (n), density, maximum clique size (ω), and their type, are listed in Table 1 We compare the execution of our methods for enumerating large k-plexes, with the most recent method for enumerating all k-plexes of the input graph [6]. All our executions have a reasonable 6 hours timeout, after which they are interrupted. Filtering. In Figure 2a, we show the number of nodes in the residual sub-graph H, resulting from the application of coreness and cliqueness criteria. We use different values of k and m = ω, i.e, the maximum clique size. As frame of comparison, we show the number of nodes in G (i.e., n). The figure shows that, except for newm where ω = 3, the sub-graph H is order of magnitudes smaller than G. Notably, for different networks (jazz, geom, hephPh, newm, and dblp), H is left with only the nodes of the maximum k-plex. In such lucky cases we can even skip the execution of the enumeration step. Since the sub-graph produced by a given k is included in the sub-graph produces by k + 1, is not surprising that higher values of k yield more nodes in H. (Remember that for k = n we have G = H.) However, for most instances, the sub-graph is small with respect to G (thus allowing for faster enumeration of k-plexes) for different values of k. Finally, our criteria have little impact on the newm network, because its maximum clique size is only 3. Figure 2b reports the number of nodes of G residual after the coreness criterion, applied separately. For the considered networks, most nodes are filtered out by coreness. Then, the structures that are too connected to be filtered by coreness but too small to play a role in the search for k-plexes non-smaller than ω, are filtered out by cliqueness. Such an additional cliqueness step has bigger impact in advogato and patents. Maximum k-plexes. In Table 2, we show running times for different steps of our algorithm max plexes(G, k) for finding the maximum k-plex, compared to the time required for enumerating all k-plexes (column “full enum”), over the same input graph. For this experiment, we set k = 2 and report in column full our approach graph enum core clique enum jazz 2 h 0,008 s 0,091 s no need grQc > 6h 0,001 s 0,026 s 1,99 s geom > 6h 0,001 s 0,086 s no need advogato > 6h 0,003 s 0,234 s 1 h 45 m hepPh > 6h 0,001 s 0,421 s no need astroPh > 6h 0,046 s 0,892 s 2,92 s newm > 6h 0,22 s 0,167 s > 6h mathSci > 6h 0,009 s 2,535 s 0,11 s dblp > 6h 0,005 s 4,336 s no need patents > 6h 1,148 s 63,258 s > 6h Table 2: Running time for finding all the largest 2-plexes. – core: running time of coreness criterion, and computing a sub-graph G0 ; – clique: running time of cliqueness criterion over G0 , and computing H; – enum: the execution time of the enumeration step over H. The time for computing our criteria has little impact on the overall running time, which is dominated by the enumeration of 2-plexes of the residual sub- graph when necessary. As a consequence, in all the networks where H is left with only the nodes of the maximum clique, which is in turn also the maximum 2-plex, the computation of our algorithm ends successfully after fractions of seconds. For such networks, we write “no need” in the “enum” column. As frame of comparison, full enumeration of 2-plexes (i.e., as in [6]) requires hours even on our smallest network (jazz), and times out on the other networks. Whenever the enumeration step is necessary, its running time ranges from fractions of seconds (most networks) to 2 h, proportionally to the size of H (see Figure 2a for comparison). The only two instances that require more than 6 h are indeed nemw and patents, that correspond to the top two largest filtered sub-graphs. We observed that the results for higher values of k, namely k = 3 and k = 4, are similar. This is because the sub-graph H produced with higher valued of k contains only few more nodes than the sub-graph produced with k = 2. Large k-plexes. In Table 3, we show the overall running time of our algo- rithm large plexes(G, k, m) for finding all k-plexes non-smaller than m on different networks in our dataset. For this experiment, we use different values of k and set m = 0.8ωk , where ωk is the maximum k-plex size, as computed by max plexes(G, k). The time required for enumerating all k-plexes (column “full enum”) of such networks is always larger than our timeout (6 hours). The table also show the number of k-plexes returned (column “found”). All the net- works considered contain less than a dozen k-plex non-smaller than 0.8ωk , which are quickly found by our algorithm in most cases. Note that in this experiment coreness and cliqueness are applied with m = 0.8ωk , which is possibly dif- ferent from the threshold than in Figure 2a (i.e., when 0.8ωk ≥ ω). full our approach graph k enum time #found 2 > 6h 4,65 s 3 grQc 4 > 6h 2,7 s 1 2 > 6h 5 h 44 m 10 astroPh 4 > 6h > 6h - 2 > 6h 2,75 s 7 mathSci 4 > 6h > 6h 7 Table 3: Time for finding all k-plexes larger than 80% of the maximum clique. 4 Related Works In the field of network analysis, dense substructures in graphs (aka dense sub- graphs) are associated with communities, or more in general sets of closely re- lated elements [14, 17]. The problem of finding these substructures has been extensively studied for decades, and continues to be the object of cutting edge research. The simplest and most rigorous definition of dense subgraph is the clique, i.e., a subgraph in which all nodes are pairwise connected. Many algo- rithms for finding all maximal cliques have been developed, most of them being inspired to the Bron-Kerbosh algorithm [7], such as [13, 18] or to the more re- cent paradigm of reverse search [3], such as [15, 9, 11]. McClosky [16] performs a thorough study to devise exact algorithms for finding the largest k-plex, and heuristics for finding lower upper bounds on its size, exploiting co-k-plexes (i.e., k-plexes on the complement graph) and graph coloring techniques. The usabil- ity of the algrorithms for finding the largest k-plex is however limited to small networks, as the running time exceeds the hour for graphs with hundreds of nodes. Cohen et al. [8] give a generic framework for enumerating all maximal subgraphs with respect to hereditary and connected hereditary graph properties, i.e., properties that are closed with respect to induced subgraphs and connected induced subgraphs, respectively. Berlowitz et al. [6] apply the framework in [8], together with insights on the k-plex problem, to produce efficient algorithms for the enumeration of maximal k-plexes and maximal connected k-plexes, which are respectively hereditary and connected hereditary. Other quasi clique mod- els include the one defined by Zhai et al. [19], that is a k-plex with additional connectivity constraint, and more that can be found in this survey [17]. 5 Conclusions We have proposed a novel approach to the enumeration of large k-plexes, a formal and meaningful way to define interesting communities in real-world networks that generalizes the notion of clique. In the future, we intend to further extend the applicability of our approach and tackle the problem of computing large k-plexes on real world networks with millions of nodes. Our future work also includes experimenting with a variety of networks coming from different domains, such as (but not limited to) biological networks, web graphs, and product co- purchasing networks. Bibliography [1] http://patrignani.dia.uniroma3.it/large-k-plexes. [Online; ac- cessed Feb-2017]. [2] http://lasagne-unifi.sourceforge.net. [Online; accessed Feb-2017]. [3] D. Avis and K. Fukuda. Reverse search for enumeration. Discrete Applied Mathematics, 65(1-3):21 – 46, 1996. [4] B. Balasundaram, S. Butenko, and I. V. Hicks. Clique relaxations in social network analysis: The maximum k-plex problem. Oper. Res., 59(1):133–142, Jan. 2011. [5] V. Batagelj and M. Zaversnik. An o(m) algorithm for cores decomposition of networks. CoRR, cs.DS/0310049, 2003. [6] D. Berlowitz, S. Cohen, and B. Kimelfeld. Efficient enumeration of maximal k-plexes. In SIGMOD, SIGMOD ’15, pages 431–444. ACM, 2015. [7] C. Bron and J. Kerbosch. Finding all cliques of an undirected graph (algo- rithm 457). Commun. ACM, 16(9):575–576, 1973. [8] S. Cohen, B. Kimelfeld, and Y. Sagiv. Generating all maximal induced sub- graphs for hereditary and connected-hereditary graph properties. Journal of Computer and System Sciences, 74(7):1147 – 1159, 2008. [9] C. Comin and R. Rizzi. An improved upper bound on maximal clique listing via rectangular fast matrix multiplication. CoRR, abs/1506.01082, 2015. [10] A. Conte, D. Firmani, C. Mordente, M. Patrignani, and R. Torlone. Fast enumeration of large k-plexes. In KDD, pages 115–124. ACM, 2017. [11] A. Conte, R. Grossi, A. Marino, and L. Versari. Sublinear-space bounded- delay enumeration for massive network analytics: Maximal cliques. In ICALP, pages 148:1–148:15, 2016. [12] A. Conte, R. D. Virgilio, A. Maccioni, M. Patrignani, and R. Torlone. Find- ing all maximal cliques in very large social networks. In EDBT, pages 173– 184, 2016. [13] D. Eppstein and D. Strash. Listing all maximal cliques in large sparse real-world graphs. In SEA, pages 364–375, 2011. [14] S. Fortunato. Community detection in graphs. Physics reports, 486(3):75– 174, 2010. [15] K. Makino and T. Uno. New algorithms for enumerating all maximal cliques. In SWAT, pages 260–272, 2004. [16] B. McClosky and I. V. Hicks. Combinatorial algorithms for the maximum k-plex problem. J. Comb. Optim., 23(1):29–49, 2012. [17] J. Pattillo, N. Youssef, and S. Butenko. Clique Relaxation Models in Social Network Analysis, pages 143–162. Springer New York, New York, NY, 2012. [18] E. Tomita, A. Tanaka, and H. Takahashi. The worst-case time complexity for generating all maximal cliques and computational experiments. Theor. Comput. Sci., 363(1):28–42, 2006. [19] H. Zhai, M. Haraguchi, Y. Okubo, and E. Tomita. A fast and complete algorithm for enumerating pseudo-cliques in large graphs. International Journal of Data Science and Analytics, 2(3-4):145–158, 2016.