Semi-Supervised Overlapping Community Finding with Pairwise Constraints Elham Alghamdi and Derek Greene School of Computer Science & Informatics, University College Dublin elham.alghamdi@ucdconnect.ie,derek.greene@ucd.ie Abstract. In complex networks, we say that a network has community structure if subsets of its nodes form dense, highly-connected groups. Algorithms for detecting communities are generally unsupervised, rely- ing solely on the network topology. However, such algorithms can often fail to uncover structure that reflects the underlying communities in the data, particularly when those communities are highly overlapping. One of the ways to improve accuracy is by harnessing additional background information (e.g. from domain experts), which can be used as a source of constraints to guide the community detection process. In this work, we explore the potential of semi-supervised strategies to improve algo- rithms for finding overlapping communities in networks. Specifically, we propose a greedy approach for finding communities using a limited num- ber of pairwise constraints. 1 Introduction Many applications of machine learning do not neatly correspond to the standard distinction between supervised and unsupervised learning [6]. In many domains, a limited degree of background knowledge will be available. Often supervision will take the form of pairwise constraints, which describe the relations between pairs of data objects. Such constraints have been used to guide and improve the usefulness of clustering algorithms [4]. The idea of semi-supervised learning also extends to the area of network analysis. Tasks such as community detection can potentially benefit from the introduction of limited supervision originating from individual domain experts or crowdsourcing, where this knowledge might be encoded as constraints indicating that a pair of nodes should always be assigned to the same community or should never be assigned to the same community. By harnessing this knowledge, we can potentially uncover communities of nodes which are difficult to identify when analyzing large or noisy networks. Initial work in community detection focused on the development of algo- rithms which produce disjoint groups, where each node belongs one community [5]. However, in many real-world networks, nodes will naturally belong to mul- tiple communities. In the case of both online and offline social networks, we can observe pervasive overlap where individuals belong to many highly-overlapping social groups [2]. More recently, overlapping community finding algorithms have been developed for application to these real networks [2, 16]. In contrast, work on semi-supervised community finding continues to focus on cases where com- munities are required to be disjoint. In this paper, we propose a semi-supervised approach for community find- ing, based on the concept of greedy clique expansion [16], which we refer to as Pairwise Constrained GCE (PC-GCE). We introduce must-link and cannot-link constraints to both the initialization phase of the process and to the subse- quent community expansion process. Experimental evaluations on benchmark synthetic networks demonstrate that the introduction of a relatively small num- ber of constraints can improve our ability to correctly uncover the underlying communities in these networks. The remainder of this paper is structured as follows: Section 2 provides a summary of relevant work pertaining to semi-supervised learning and commu- nity finding. In Section 3 we describe the proposed approach for community finding. To demonstrate the effectiveness of the approach, in Section 4 we per- form a benchmarking evaluation on several synthetic networks. Finally, Section 5 presents concluding remarks and suggestions for extending this work. 2 Related Work 2.1 Community Finding Finding non-overlapping communities. Algorithms in this context can be broadly grouped into three types. (1) Hierarchical algorithms construct a tree of communities based on the network topology. These algorithms can be one of two types: divisive algorithms [10] or agglomerative algorithms [7]. (2) Modularity- based algorithms optimize the well-known modularity objective function to un- cover communities in the network [21]. (3) Other algorithms. This category in- cludes algorithms based on label propagation approach, spectral algorithms that make use of the eigenvectors of Laplacian matrix or standard matrix, and meth- ods based on statistical models [9]. Finding overlapping communities. Existing algorithms in this context can be classified into four main categories. (1) Node seeds and local expansion. These algorithms detect communities by starting from a node or a group of nodes, then expanding them into a community using a quality function. OSLOM [13] is an example of such an algorithm, which uses a statistical function to evaluate the node value to expand it to a community. Another example is MOSES [20], which is an algorithm based on a statistical model and uses an objective function as a greedy optimization technique. (2) Clique expansion. This type of method uses a group of fully-connected nodes, called a clique, as the starting point for expan- sion. CFinder [1] and Greedy Clique Expansion (GCE) [16] are examples of this type of algorithm. (3) Link clustering. This category of algorithms detects com- munities by splitting the links rather than the nodes [3]. (4) Label propagation. This strategy classifies each node into a community based on its neighboring nodes affinities. An example is the COPRA algorithm [11]. 2.2 Semi-Supervised Learning Several forms of prior knowledge have been used to guide the community detec- tion process. The most widely-used has been pairwise constraints (”must-link” or ”cannot-link”), which indicate that two nodes must be in the same community or must be in different communities. Such constraints have been implemented in several algorithms, including a modularity-based method [18], a spectral analysis method [12, 23], and methods based on matrix factorization [22, 23]. Instead of constraints, other algorithms use node labels as prior knowledge to improve the process of community detection [17]. In [19], the authors propose a method that uses a semi-supervised label propagation algorithm based on node labels and negative information, where a node does not belong to a specific community. The clear majority of semi-supervised algorithms in this area aim at detecting disjoint communities, whereas many real-world networks contain overlapping communities [1]. To the best of our knowledge, very little work has been done in the context of finding overlapping communities context. In [8], a small set of nodes called seed nodes was used, whose affinities to a community is provided as prior knowledge to infer the rest of the nodes affinities in the network. In the remainder of this paper we focus on the problem of semi-supervised community finding suitable for application to networks containing overlapping communities. 3 Methods 3.1 Greedy Clique Expansion (GCE) The GCE [16] community finding algorithm initially finds maximal cliques as seeds, and subsequently expands these seeds into larger communities in a greedy fashion, by optimizing a local fitness function. Given a network G, a user- specified minimum clique size k, and a minimum community distance e, the GCE algorithm involves the following steps: 1. Find the seeds, which are all maximal cliques in G with at least k nodes. 2. Choose the largest unexpanded seed and greedily expand it into a candidate community C0 by using a community fitness function FS until adding any node would not increase the fitness value. 3. Test the distance between C0 and all previously accepted communities. If the distance between C0 and an existing community C is < e, then C and C0 are deemed to be near-duplicates, so discard C0. Otherwise, accept C0. 4. Repeat steps 2 and 3 until no more seeds remain GCE employs a fitness function defined [14] to expand each seed, which defines the fitness community of a given community S in terms of its internal and external degrees (kin and kout ) as follows S kin FS = S + k S )α (1) (kin out where the parameter α typically takes values in the range [0.9, 1.5]. Generally, we can summarize the greedy expansion step using the fitness function FS as follows: 1. For each node v in the frontier of a given seed S, calculate v’s fitness value, i.e., the amount by which the community fitness of S would change if the node v was added to S. 2. Choose the node that has the maximum fitness value, vmax . 3. If the fitness value vmax is positive, then insert the node v into S and go back to Step 1. Otherwise, terminate and return S. To discard near-duplicate communities in Step 4 of the GCE algorithm, the authors proposed the use of an overlap-based measure of distance between two communities: |S ∩ S 0 | δE (S, S 0 ) = 1 − (2) min(|S|, |S 0 |) Given two communities S and S0, Eqn. 2 measures the number of nodes in the smaller community that are not included in the larger one. So, for a given set of communities, a near-duplicate community of a given community S would be all the communities that are within a distance e of S. 3.2 Pairwise Constraints in Community Detection Given a network that contains a set of nodes V , semi-supervised pairwise con- straints typically take two possible forms: 1. Must-link constraints specify that two nodes must be in the same community. Let CM L be the must-link constraint set: ∀ vi , vj ∈ V where i 6= j, (vi , vj ) ∈ CM L indicates that the two nodes vi and vj must be assigned to the same community. 2. Cannot-link constraints specify that two nodes must be in different commu- nities. Let CCL be the cannot-link constraint set: ∀ vi , vj ∈ V where i 6= j, (vi , vj ) ∈ CCL indicates that the two nodes vi and vj must be assigned to two distinct communities. The simplest approach to selecting this form of constraints is to naively select a pair of nodes (vi , vj ) at random, and query the oracle about whether the pair should share a must-link or cannot-link constraint. This process can be repeated to select the required number of constraints or until some supervision budget is exhausted. In non-overlapping community finding, must-link constraints have what is referred to as a transitive property, where a third must-link relationship can be inferred from two other associated must-link constraint pairs. So, if (vi , vj ) ∈ CM L , and (vi , vk ) ∈ CM L , then we can also infer that (vj , vk ) ∈ CM L (see the first example in Fig. 1). However, incorporating constraints into the context of overlapping commu- nities is more challenging. This is because the transitive property does not hold %& %' %& %' %& %' !" !" !" !$ !# !$ !# !$ !# Non-overlapping Case 1 Overlapping Case Fig. 1: In the non-overlapping context, the transitive property allows us to infer a third must-link constraint from two existing must-link constraints. However, this does not automatically apply in the overlapping context, where two possible situations exist. Must-link Cannot-link Must-link Cannot-link Set Set Set Set !" !# !) !* !" !# !) !* !" !$ !+ !, !" !$ !+ !, !" !% !& !% !" !% !- !, !" !& !# !& !" !& !. !/ !' !( !$ !% !' !( !0 !1 !& !% !# !& !$ !% !$ !# Fig. 2: Example of expanding 5 pairwise constraints. Inserting the derived cannot-link pairs from all connected must-link pairs will double the size of the cannot-link set. here (see the second example in Fig. 1). Specifically, if (vi , vj ) ∈ CM L , and (vi , vk ) ∈ CM L , there are two possible scenarios for the pair (vj , vk ). It can be the case that either (vj , vk ) ∈ CM L or (vj , vk ) ∈ CCL . This is because an overlapping node vj can have a must-link constraint with both vi and vi , yet these two nodes could belong to two different communities. However, it is also possible that all three nodes are in fact in the same community. Unless we explicitly inform the algorithm about whether a must-link or cannot-link constraint exists for the pair (vj , vk ), we cannot reliably distinguish between the two cases. If the network has highly-overlapping communities, then this problematic situation will occur more frequently. If we naively attempt to incorporate pairwise constraints into over- lapping community finding without taking this situation into account, it is likely that the quality of the resulting communities can potentially decrease even as more constraints are added. To resolve this issue, after selecting pairwise constraints, we need to explicitly detect every cannot-link pair that derived from any two connect must-link pairs and insert it to the cannot link set. However, this will significantly increase the set of cannot-link constraints. For example, if we want to feed only five pairwise constraints, each as shown in Fig. 2, inserting the required cannot-link pairs will double the size of the cannot-link set. In the next section, we introduce an approach to address this issue. 3.3 Semi-Supervised GCE We now describe our proposed approach for overlapping community finding with limited supervision, referred to as Pairwise Constrained GCE (PC-GCE). This approach consists of two stages: The first stage includes selecting and pre- processing constraints and resolves the problem of the lack of the transitive property for must-link constraints in the context of overlapping communities. The second stage supplies the resulting constraints to the GCE algorithm to process them during community detection. In the following, these stages are de- scribed in more detail. Stage 1: Selecting and pre-processing constraints. In this stage, we can treat the set of pairwise constraints as a new graph, where an edge exists between two nodes from the original network if they share a pairwise constraint (either must-link or cannot-link). Then we look for all possible forbidden triads among the nodes involved in the must-link set. Given three nodes A, B, C, a forbidden triad (or open triad ) occurs when A is connected to B and C, but no edge exists between B and C. In our pre-processing step, we look for such cases — i.e. where we do not know whether a must-link or cannot-link exists between a pair of nodes B and C. To control the size of the constraints set, we greedily expand it until we reach a pre-defined maximum size. This stage can be summarized as follows (see also Fig. 3): 1. Choose a small initial random set of both must-link and cannot-link sets. 2. Find all possible forbidden triad cases in the must-link set to query the oracle about their relationship. 3. If their relationship is must-link insert it into must-link set, otherwise insert it into cannot-link set. 4. Repeat all steps until the set reaches the maximum size. At the end of this stage, the pairwise constraints set is ready to be supplied to the GCE algorithm for community detection. Stage 2: Pairwise Constrained GCE (PC-GCE). During the community detection phase, we incorporate only cannot-link constraints into the existing GCE algorithm as follows (see Fig. 4 for an illustration): 1. Find seeds, which are all maximal cliques in G with at least k nodes. 2. Choose the largest unexpanded seed and greedily expand it to a candidate community C0 by using a community fitness function (Eqn. 1) until adding any node no longer increases the fitness value. However, during this expan- sion process, do not add any node, which has a cannot link relationship with any existing node in the seed. 3. Check for the existence of any cannot-link constraints among all pairs of nodes in C0. If such a pair exists, calculate the fitness for both nodes relative to C0, and remove the one with the lower value. 4. Test the distance between the community C0 and all of the already accepted communities, using Eqn. 2. If the distance between C0 and any accepted community C is