Introduction

Semi-Supervised Overlapping Community Finding with Pairwise Constraints

Elham Alghamdi

elham.alghamdi@ucdconnect.ie 0

Derek Greene

derek.greene@ucd.ie 0 0 School of Computer Science & Informatics, University College Dublin

In complex networks, we say that a network has community structure if subsets of its nodes form dense, highly-connected groups. Algorithms for detecting communities are generally unsupervised, relying solely on the network topology. However, such algorithms can often fail to uncover structure that re ects the underlying communities in the data, particularly when those communities are highly overlapping. One of the ways to improve accuracy is by harnessing additional background information (e.g. from domain experts), which can be used as a source of constraints to guide the community detection process. In this work, we explore the potential of semi-supervised strategies to improve algorithms for nding overlapping communities in networks. Speci cally, we propose a greedy approach for nding communities using a limited number of pairwise constraints.

Introduction

Many applications of machine learning do not neatly correspond to the standard distinction between supervised and unsupervised learning [ 6 ]. In many domains, a limited degree of background knowledge will be available. Often supervision will take the form of pairwise constraints, which describe the relations between pairs of data objects. Such constraints have been used to guide and improve the usefulness of clustering algorithms [ 4 ]. The idea of semi-supervised learning also extends to the area of network analysis. Tasks such as community detection can potentially bene t from the introduction of limited supervision originating from individual domain experts or crowdsourcing, where this knowledge might be encoded as constraints indicating that a pair of nodes should always be assigned to the same community or should never be assigned to the same community. By harnessing this knowledge, we can potentially uncover communities of nodes which are di cult to identify when analyzing large or noisy networks.

Initial work in community detection focused on the development of algorithms which produce disjoint groups, where each node belongs one community [ 5 ]. However, in many real-world networks, nodes will naturally belong to multiple communities. In the case of both online and o ine social networks, we can observe pervasive overlap where individuals belong to many highly-overlapping social groups [ 2 ]. More recently, overlapping community nding algorithms have been developed for application to these real networks [ 2, 16 ]. In contrast, work on semi-supervised community nding continues to focus on cases where communities are required to be disjoint.

In this paper, we propose a semi-supervised approach for community nding, based on the concept of greedy clique expansion [ 16 ], which we refer to as Pairwise Constrained GCE (PC-GCE). We introduce must-link and cannot-link constraints to both the initialization phase of the process and to the subsequent community expansion process. Experimental evaluations on benchmark synthetic networks demonstrate that the introduction of a relatively small number of constraints can improve our ability to correctly uncover the underlying communities in these networks.

The remainder of this paper is structured as follows: Section 2 provides a summary of relevant work pertaining to semi-supervised learning and community nding. In Section 3 we describe the proposed approach for community nding. To demonstrate the e ectiveness of the approach, in Section 4 we perform a benchmarking evaluation on several synthetic networks. Finally, Section 5 presents concluding remarks and suggestions for extending this work. 2

Related Work

2.1

Community Finding

Finding non-overlapping communities. Algorithms in this context can be broadly grouped into three types. (1) Hierarchical algorithms construct a tree of communities based on the network topology. These algorithms can be one of two types: divisive algorithms [ 10 ] or agglomerative algorithms [ 7 ]. (2) Modularitybased algorithms optimize the well-known modularity objective function to uncover communities in the network [ 21 ]. (3) Other algorithms. This category includes algorithms based on label propagation approach, spectral algorithms that make use of the eigenvectors of Laplacian matrix or standard matrix, and methods based on statistical models [ 9 ].

Finding overlapping communities. Existing algorithms in this context can be classi ed into four main categories. (1) Node seeds and local expansion. These algorithms detect communities by starting from a node or a group of nodes, then expanding them into a community using a quality function. OSLOM [ 13 ] is an example of such an algorithm, which uses a statistical function to evaluate the node value to expand it to a community. Another example is MOSES [ 20 ], which is an algorithm based on a statistical model and uses an objective function as a greedy optimization technique. (2) Clique expansion. This type of method uses a group of fully-connected nodes, called a clique, as the starting point for expansion. CFinder [ 1 ] and Greedy Clique Expansion (GCE) [ 16 ] are examples of this type of algorithm. (3) Link clustering. This category of algorithms detects communities by splitting the links rather than the nodes [ 3 ]. (4) Label propagation. This strategy classi es each node into a community based on its neighboring nodes a nities. An example is the COPRA algorithm [ 11 ]. 2.2

Semi-Supervised Learning

Several forms of prior knowledge have been used to guide the community detection process. The most widely-used has been pairwise constraints ("must-link" or "cannot-link"), which indicate that two nodes must be in the same community or must be in di erent communities. Such constraints have been implemented in several algorithms, including a modularity-based method [ 18 ], a spectral analysis method [ 12, 23 ], and methods based on matrix factorization [ 22, 23 ]. Instead of constraints, other algorithms use node labels as prior knowledge to improve the process of community detection [ 17 ]. In [ 19 ], the authors propose a method that uses a semi-supervised label propagation algorithm based on node labels and negative information, where a node does not belong to a speci c community.

The clear majority of semi-supervised algorithms in this area aim at detecting disjoint communities, whereas many real-world networks contain overlapping communities [ 1 ]. To the best of our knowledge, very little work has been done in the context of nding overlapping communities context. In [ 8 ], a small set of nodes called seed nodes was used, whose a nities to a community is provided as prior knowledge to infer the rest of the nodes a nities in the network. In the remainder of this paper we focus on the problem of semi-supervised community nding suitable for application to networks containing overlapping communities. 3 3.1

Methods Greedy Clique Expansion (GCE)

The GCE [ 16 ] community nding algorithm initially nds maximal cliques as seeds, and subsequently expands these seeds into larger communities in a greedy fashion, by optimizing a local tness function. Given a network G, a userspeci ed minimum clique size k, and a minimum community distance e, the GCE algorithm involves the following steps: 1. Find the seeds, which are all maximal cliques in G with at least k nodes. 2. Choose the largest unexpanded seed and greedily expand it into a candidate community C0 by using a community tness function FS until adding any node would not increase the tness value. 3. Test the distance between C0 and all previously accepted communities. If the distance between C0 and an existing community C is < e, then C and C0 are deemed to be near-duplicates, so discard C0. Otherwise, accept C0. 4. Repeat steps 2 and 3 until no more seeds remain GCE employs a tness function de ned [ 14 ] to expand each seed, which de nes the tness community of a given community S in terms of its internal and external degrees (kin and kout) as follows where the parameter typically takes values in the range [0:9; 1:5]. Generally, we can summarize the greedy expansion step using the tness function FS as follows: 1. For each node v in the frontier of a given seed S, calculate v's tness value, i.e., the amount by which the community tness of S would change if the node v was added to S. 2. Choose the node that has the maximum tness value, vmax. 3. If the tness value vmax is positive, then insert the node v into S and go back to Step 1. Otherwise, terminate and return S.

To discard near-duplicate communities in Step 4 of the GCE algorithm, the authors proposed the use of an overlap-based measure of distance between two communities:

E (S; S0) = 1

jS \ S0j min(jSj; jS0j) (2) Given two communities S and S0, Eqn. 2 measures the number of nodes in the smaller community that are not included in the larger one. So, for a given set of communities, a near-duplicate community of a given community S would be all the communities that are within a distance e of S. 3.2

Pairwise Constraints in Community Detection

Given a network that contains a set of nodes V , semi-supervised pairwise constraints typically take two possible forms: 1. Must-link constraints specify that two nodes must be in the same community.

Let CML be the must-link constraint set: 8 vi; vj 2 V where i 6= j, (vi; vj ) 2 CML indicates that the two nodes vi and vj must be assigned to the same community. 2. Cannot-link constraints specify that two nodes must be in di erent communities. Let CCL be the cannot-link constraint set: 8 vi; vj 2 V where i 6= j, (vi; vj ) 2 CCL indicates that the two nodes vi and vj must be assigned to two distinct communities.

The simplest approach to selecting this form of constraints is to naively select a pair of nodes (vi; vj ) at random, and query the oracle about whether the pair should share a must-link or cannot-link constraint. This process can be repeated to select the required number of constraints or until some supervision budget is exhausted.

In non-overlapping community nding, must-link constraints have what is referred to as a transitive property, where a third must-link relationship can be inferred from two other associated must-link constraint pairs. So, if (vi; vj ) 2 CML, and (vi; vk) 2 CML, then we can also infer that (vj ; vk) 2 CML (see the rst example in Fig. 1).

However, incorporating constraints into the context of overlapping communities is more challenging. This is because the transitive property does not hold %& !" !$ !# %'

%& !$ !" %' !# %& !" !$ !# %' Non-overlapping Case 1 Overlapping Case Fig. 1: In the non-overlapping context, the transitive property allows us to infer a third must-link constraint from two existing must-link constraints. However, this does not automatically apply in the overlapping context, where two possible situations exist. Must-link 

Set Cannot-link 

Set Must-link 

Set Cannot-link 

Set !" !# !" !$ !" !% !" !& !' !( !) !* !+ !, !& !% !# !& !$ !% !" !# !" !$ !" !% !" !& !' !( !) !* !+ !, !- !, !. !/ !0 !1 !& !% !# !& !$ !% !$ !# here (see the second example in Fig. 1). Speci cally, if (vi; vj ) 2 CML, and (vi; vk) 2 CML, there are two possible scenarios for the pair (vj ; vk). It can be the case that either (vj ; vk) 2 CML or (vj ; vk) 2 CCL. This is because an overlapping node vj can have a must-link constraint with both vi and vi, yet these two nodes could belong to two di erent communities. However, it is also possible that all three nodes are in fact in the same community. Unless we explicitly inform the algorithm about whether a must-link or cannot-link constraint exists for the pair (vj ; vk), we cannot reliably distinguish between the two cases. If the network has highly-overlapping communities, then this problematic situation will occur more frequently. If we naively attempt to incorporate pairwise constraints into overlapping community nding without taking this situation into account, it is likely that the quality of the resulting communities can potentially decrease even as more constraints are added.

To resolve this issue, after selecting pairwise constraints, we need to explicitly detect every cannot-link pair that derived from any two connect must-link pairs and insert it to the cannot link set. However, this will signi cantly increase the set of cannot-link constraints. For example, if we want to feed only ve pairwise constraints, each as shown in Fig. 2, inserting the required cannot-link pairs will double the size of the cannot-link set. In the next section, we introduce an approach to address this issue. 3.3

Semi-Supervised GCE

We now describe our proposed approach for overlapping community nding with limited supervision, referred to as Pairwise Constrained GCE (PC-GCE). This approach consists of two stages: The rst stage includes selecting and preprocessing constraints and resolves the problem of the lack of the transitive property for must-link constraints in the context of overlapping communities. The second stage supplies the resulting constraints to the GCE algorithm to process them during community detection. In the following, these stages are described in more detail.

Stage 1: Selecting and pre-processing constraints. In this stage, we can treat the set of pairwise constraints as a new graph, where an edge exists between two nodes from the original network if they share a pairwise constraint (either must-link or cannot-link). Then we look for all possible forbidden triads among the nodes involved in the must-link set. Given three nodes A, B, C, a forbidden triad (or open triad ) occurs when A is connected to B and C, but no edge exists between B and C. In our pre-processing step, we look for such cases | i.e. where we do not know whether a must-link or cannot-link exists between a pair of nodes B and C. To control the size of the constraints set, we greedily expand it until we reach a pre-de ned maximum size. This stage can be summarized as follows (see also Fig. 3): 1. Choose a small initial random set of both must-link and cannot-link sets. 2. Find all possible forbidden triad cases in the must-link set to query the oracle about their relationship. 3. If their relationship is must-link insert it into must-link set, otherwise insert it into cannot-link set. 4. Repeat all steps until the set reaches the maximum size.

At the end of this stage, the pairwise constraints set is ready to be supplied to the GCE algorithm for community detection.

Stage 2: Pairwise Constrained GCE (PC-GCE). During the community

detection phase, we incorporate only cannot-link constraints into the existing GCE algorithm as follows (see Fig. 4 for an illustration): 1. Find seeds, which are all maximal cliques in G with at least k nodes. 2. Choose the largest unexpanded seed and greedily expand it to a candidate community C0 by using a community tness function (Eqn. 1) until adding any node no longer increases the tness value. However, during this expansion process, do not add any node, which has a cannot link relationship with any existing node in the seed. 3. Check for the existence of any cannot-link constraints among all pairs of nodes in C0. If such a pair exists, calculate the tness for both nodes relative to C0, and remove the one with the lower value.

The justi cation for using only cannot-link constraints in the community detection process is as follows: because of the greedy nature of the seed expansion step, incorporating must-link constraints into this step results in a smaller number of considerably larger communities. Mainly using the tness function as a technique of greedy local optimization to expand clique to community already achieves some of the must-link relationships, and processing cannot-link set will mostly detect the pairs that derived from two connected must-link pairs. Thus, using must-link set to be explicitly processed by the algorithm will be as processing extra non-informative constraints which cause noise to the algorithm and reduction of the accuracy. 4

Evaluation

In this section, the performance of the Pairwise Constrained GCE algorithm (PC-GCE) is evaluated by running experiments on two groups of synthetic benchmark networks containing overlapping communities. Since, to the best of our knowledge, no work has been conducted in the literature regarding pairwise constrained algorithms for nding overlapping communities, for the sake of comparison, the PC-GCE results are compared with the following unsupervised overlapping community detection algorithms: standard GCE [ 16 ], OSLOM [ 13 ], MOSES [ 20 ], and COPRA [ 1 ]. 4.1

Data

The synthetic networks used in our experiments is generated using the widelyused LFR benchmark generator [ 15 ], which can produce networks with properties Step 1: Generate small initial random set of constraints.

Step 2: Find all instances of forbidden triads among the must-link constraints.

Step 3: Query the oracle for constraints for the missing pairs.

Step 4: Generate another random set of constraints.

Must

Cannot Must

Cannot

Must

Cannot Fig. 3: An illustration of the four steps involved in Stage 1 of semi-supervised GCE. Step 1: Detect all maximal cliques in the network containing at least k nodes.

Step 2: Expand each seed, Step 3: Process the Step 4: If C' overlaps with skipping any node that has a cannot-link set for each any previously accepted cannot-link with any existing resulting community C'. community, then discard it.

node in the current seed. Otherwise, accept C'.

Fig. 4: An illustration of the four steps involved in Stage 2 of semi-supervised GCE. similar to real-world networks, which also contain embedded ground truth communities. The full set of parameter values used to generate the networks is listed in Table 1.

In our experiments, we generate two di erent groups of networks, containing small and large communities respectively. Small communities have 10{50 nodes, while large communities have 20{100 nodes. Each group consists of 16 networks with di erent combinations of the two parameters Om and On. The parameter Om controls the number of communities per node, and On controls the number of overlapping nodes. For the rst network in each set, all nodes belong to two communities (Om = 2), then for each successive network this parameter increments in value by 1 until Om = 5 is reached. For each value taken by the parameter Om, we increase the fraction of overlapping nodes On by 25% until 100% of the nodes belong to more than one community. To compare the performance of the di erent algorithms in our experiments, we use the overlapping form of the standard Normalized Mutual Information (NMI) measure, as proposed in [ 14 ]. This measures the level of agreement between the communities produced by an algorithm on a network and the ground truth communities in that network. A value close to 1 indicates a high level of agreement, while a value close to 0 indicates that the algorithms communities are no better than random.

We have conducted two experiments. The rst experiment aims to measure the current performance of the selected community detection algorithms. We use these values as a baseline for evaluating the performance of the proposed PC-GCE algorithm. The second experiment evaluates the performance of the PC-GCE on di ering numbers of constraints, ranging from 1% to 5% of the total number of possible constraints pairs in each network. In this experiment, the initial pairwise constraints are selected at random. Therefore, we repeat the process over 20 runs and average the NMI scores. Finally, we compare the results obtained from PC-GCE with the selected benchmark algorithms. oFvrearcltaipopninogf 255000 00..75563664 00..60103080 00..40905010 00..30509050 oFvrearcltaipopninogf 255000 00..75122117 00..60203050 00..40804040 00..40307070 We have two baselines for comparison and evaluation of the results. Firstly, we compare the accuracy of PC-GCE to standard GCE (Tables 2). Secondly, we compare the accuracy of PC-GCE algorithm to the other benchmark algorithms (Table 3).

As we observe from Tables 2, in most cases, regardless of the fraction of overlapping nodes in the networks, PC-GCE outperforms the standard GCE algorithm. To be more speci c, for all measures of fraction of overlapping nodes, as the percentage of pairwise constraints increases, the accuracy of PC-GCE also improves. On the other hand, as we increase the fraction of overlapping nodes On from 25% to 100% of the total number of nodes, the NMI of GCE drops to 0 for almost 60% of the total set of results. For instance, in the case of small community networks, the NMI score of GCE drops from 0.764 to 0 for Om = 3, and from 0.648 to 0 for Om = 4. In contrast, the PC-GCE algorithm shows a moderate decrease of accuracy as the value of On increases. This indicates that PC-GCE outperform the standard GCE with highly-overlapping communities. However, in general, both algorithms show better performance on the networks containing smaller communities.

We can see from Table 3 that PC-GCE algorithm outperforms COPRA algorithm on both types of networks. When considering the smallest fraction of overlapping nodes (i.e., On = 250), COPRA performs almost comparably with PC-GCE. However, as the value of On increases, the performance of COPRA drops to 0, whereas the performance of PC-GCE shows only a slight decrease in the accuracy. Thus, it can be concluded that PC-GCE outperforms COPRA on highly overlapping community networks. On the other hand, PC-GCE outperforms MOSES on large networks but shows almost similar performance on small networks. Finally, the NMI scores exhibit comparable overall performance with PC-GCE in the case of OSLOM, on both types of networks. 5

Conclusion

In this paper, we have explored the potential of semi-supervised strategies to improve existing algorithms for nding overlapping communities in networks, and a new algorithm for detecting overlapping communities with pairwise constraints (PC-GCE) is proposed. Extensive experiments were carried out, and the results show GCE algorithm with constraints (PC-GCE) outperforms unconstrained algorithms with highly-overlapping communities, and its performance improves with increasing number of pairwise constraints. This shows the potential of using semi-supervised strategies for nding overlapping communities. However, in most networks, only a small percentage of the nodes have informative pairwise constraints. Thus, using random selection of pairwise constraints could have adverse e ects by decreasing the accuracy of the community detection process caused by the selection of non-informative nodes. Therefore, our future work will aim to apply ideas from active learning for selecting informative pairwise constraints. We will also explore the e ect of incorrect \noisy" constraints and how they a ect algorithm performance.

Acknowledgements. This publication has partly emanated from research conducted with the nancial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.

1. Adamcsek , B. , Palla , G. , Farkas , I.J. , Derenyi , I. , Vicsek , T.: C nder: locating cliques and overlapping modules in biological networks . Bioinformatics 22 ( 8 ), 1021 { 1023 ( 2006 )

2. Ahn , Y.Y. , Bagrow , J.P. , Lehmann , S. : Link communities reveal multiscale complexity in networks . Nature 466 ( 7307 ), 761 { 764 ( 2010 )

3. Amelio , A. , Pizzuti , C. : Overlapping community discovery methods: a survey . In: Social Networks: Analysis and Case Studies , pp. 105 { 125 . Springer ( 2014 )

4. Basu , S. , Bilenko , M. , Mooney , R.J.: A probabilistic framework for semi-supervised clustering . In: Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining . pp. 59 { 68 ( 2004 )

5. Blondel , V. , Guillaume , J. , Lambiotte , R. , Lefebvre , E.: Fast unfolding of communities in large networks . J. Stat. Mech 10008 ( 2008 )

6. Chapelle , O. , Scholkopf, B. , Zien , A . (eds.): Semi-Supervised Learning . MIT Press, Cambridge ( 2006 )

7. Clauset , A. , Newman , M.E. , Moore , C. : Finding community structure in very large networks . Physical review E 70 ( 6 ), 066111 ( 2004 )

8. Dreier , J. , Kuinke , P. , Przybylski , R. , Reidl , F. , Rossmanith , P. , Sikdar , S. : Overlapping communities in social networks . arXiv preprint arXiv:1412.4973 ( 2014 )

9. Fortunato , S. : Community detection in graphs . Physics reports 486(3) , 75 { 174 ( 2010 )

10. Girvan , M. , Newman , M.E. : Community structure in social and biological networks . Proceedings of the national academy of sciences 99(12) , 7821 { 7826 ( 2002 )

11. Gregory , S. : Finding overlapping communities in networks by label propagation . New Journal of Physics 12 ( 10 ), 103018 ( 2010 )

12. Habashi , S. , Ghanem , N.M. , Ismail , M.A. : Enhanced community detection in social networks using active spectral clustering . In: Proceedings of the 31st Annual ACM Symposium on Applied Computing . pp. 1178 { 1181 . ACM ( 2016 )

13. Lancichinetti , A. , Radicchi , F. , Ramasco , J. , Fortunato , S. , Ben-Jacob , E. : Finding statistically signi cant communities in networks . PLoS ONE 6 ( 4 ), e18961 ( 2011 )

14. Lancichinetti , A. , Fortunato , S. , Kertesz , J.: Detecting the overlapping and hierarchical community structure in complex networks . New Journal of Physics 11 ( 3 ), 033015 ( 2009 )

15. Lancichinetti , A. , Fortunato , S. , Radicchi , F. : Benchmark graphs for testing community detection algorithms . Physical review E 78 ( 4 ), 046110 ( 2008 )

16. Lee , C. , Reid , F. , McDaid , A. , Hurley , N.: Detecting highly overlapping community structure by greedy clique expansion . In: Workshop on Social Network Mining and Analysis ( 2010 )

17. Leng , M. , Yao , Y., Cheng, J., Lv , W. , Chen , X. : Active semi-supervised community detection algorithm with label propagation . In: International Conference on Database Systems for Advanced Applications . pp. 324 { 338 . Springer ( 2013 )

18. Li , L. , Du , M. , Liu , G. , Hu , X. , Wu , G. : Extremal optimization-based semisupervised algorithm with con ict pairwise constraints for community detection . In: Advances in Social Networks Analysis and Mining (ASONAM) , 2014 IEEE/ACM International Conference on. pp. 180 { 187 . IEEE ( 2014 )

19. Liu , D. , Duan , D. , Sui , S. , Song , G.: E ective semi-supervised community detection using negative information . Mathematical Problems in Engineering 2015 ( 2015 )

20. McDaid , A. , Hurley , N.: Detecting highly overlapping communities with modelbased overlapping seed expansion . In: Advances in Social Networks Analysis and Mining (ASONAM) , 2010 International Conference on. pp. 112 { 119 . IEEE ( 2010 )

21. Newman , M.E. : Modularity and community structure in networks . Proceedings of the national academy of sciences 103(23) , 8577 { 8582 ( 2006 )

22. Shi , X. , Lu , H. , He , Y. , He , S. : Community detection in social network with pairwisely constrained symmetric non-negative matrix factorization . In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 . pp. 541 { 546 . ACM ( 2015 )

23. Zhang , Z.Y. : Community structure detection in complex networks with partial background information . EPL (europhysics letters) 101 ( 4 ), 48005 ( 2013 )