=Paper= {{Paper |id=Vol-2086/AICS2017_paper_28 |storemode=property |title=Semi-Supervised Overlapping Community Finding with Pairwise Constraints |pdfUrl=https://ceur-ws.org/Vol-2086/AICS2017_paper_28.pdf |volume=Vol-2086 |authors=Elham Alghamdi,Derek Greene |dblpUrl=https://dblp.org/rec/conf/aics/AlghamdiG17 }} ==Semi-Supervised Overlapping Community Finding with Pairwise Constraints== https://ceur-ws.org/Vol-2086/AICS2017_paper_28.pdf

Semi-Supervised Overlapping Community
Finding with Pairwise Constraints

Elham Alghamdi and Derek Greene

School of Computer Science & Informatics, University College Dublin
elham.alghamdi@ucdconnect.ie,derek.greene@ucd.ie

Abstract. In complex networks, we say that a network has community
structure if subsets of its nodes form dense, highly-connected groups.
Algorithms for detecting communities are generally unsupervised, rely-
ing solely on the network topology. However, such algorithms can often
fail to uncover structure that reflects the underlying communities in the
data, particularly when those communities are highly overlapping. One
of the ways to improve accuracy is by harnessing additional background
information (e.g. from domain experts), which can be used as a source
of constraints to guide the community detection process. In this work,
we explore the potential of semi-supervised strategies to improve algo-
rithms for finding overlapping communities in networks. Specifically, we
propose a greedy approach for finding communities using a limited num-
ber of pairwise constraints.

1 Introduction
Many applications of machine learning do not neatly correspond to the standard
distinction between supervised and unsupervised learning [6]. In many domains,
a limited degree of background knowledge will be available. Often supervision
will take the form of pairwise constraints, which describe the relations between
pairs of data objects. Such constraints have been used to guide and improve the
usefulness of clustering algorithms [4]. The idea of semi-supervised learning also
extends to the area of network analysis. Tasks such as community detection can
potentially benefit from the introduction of limited supervision originating from
individual domain experts or crowdsourcing, where this knowledge might be
encoded as constraints indicating that a pair of nodes should always be assigned
to the same community or should never be assigned to the same community.
By harnessing this knowledge, we can potentially uncover communities of nodes
which are difficult to identify when analyzing large or noisy networks.
Initial work in community detection focused on the development of algo-
rithms which produce disjoint groups, where each node belongs one community
[5]. However, in many real-world networks, nodes will naturally belong to mul-
tiple communities. In the case of both online and offline social networks, we can
observe pervasive overlap where individuals belong to many highly-overlapping
social groups [2]. More recently, overlapping community finding algorithms have
been developed for application to these real networks [2, 16]. In contrast, work
on semi-supervised community finding continues to focus on cases where com-
munities are required to be disjoint.
In this paper, we propose a semi-supervised approach for community find-
ing, based on the concept of greedy clique expansion [16], which we refer to as
Pairwise Constrained GCE (PC-GCE). We introduce must-link and cannot-link
constraints to both the initialization phase of the process and to the subse-
quent community expansion process. Experimental evaluations on benchmark
synthetic networks demonstrate that the introduction of a relatively small num-
ber of constraints can improve our ability to correctly uncover the underlying
communities in these networks.
The remainder of this paper is structured as follows: Section 2 provides a
summary of relevant work pertaining to semi-supervised learning and commu-
nity finding. In Section 3 we describe the proposed approach for community
finding. To demonstrate the effectiveness of the approach, in Section 4 we per-
form a benchmarking evaluation on several synthetic networks. Finally, Section
5 presents concluding remarks and suggestions for extending this work.

2 Related Work

2.1 Community Finding

Finding non-overlapping communities. Algorithms in this context can be
broadly grouped into three types. (1) Hierarchical algorithms construct a tree of
communities based on the network topology. These algorithms can be one of two
types: divisive algorithms [10] or agglomerative algorithms [7]. (2) Modularity-
based algorithms optimize the well-known modularity objective function to un-
cover communities in the network [21]. (3) Other algorithms. This category in-
cludes algorithms based on label propagation approach, spectral algorithms that
make use of the eigenvectors of Laplacian matrix or standard matrix, and meth-
ods based on statistical models [9].
Finding overlapping communities. Existing algorithms in this context can
be classified into four main categories. (1) Node seeds and local expansion. These
algorithms detect communities by starting from a node or a group of nodes, then
expanding them into a community using a quality function. OSLOM [13] is an
example of such an algorithm, which uses a statistical function to evaluate the
node value to expand it to a community. Another example is MOSES [20], which
is an algorithm based on a statistical model and uses an objective function as a
greedy optimization technique. (2) Clique expansion. This type of method uses
a group of fully-connected nodes, called a clique, as the starting point for expan-
sion. CFinder [1] and Greedy Clique Expansion (GCE) [16] are examples of this
type of algorithm. (3) Link clustering. This category of algorithms detects com-
munities by splitting the links rather than the nodes [3]. (4) Label propagation.
This strategy classifies each node into a community based on its neighboring
nodes affinities. An example is the COPRA algorithm [11].
2.2 Semi-Supervised Learning

Several forms of prior knowledge have been used to guide the community detec-
tion process. The most widely-used has been pairwise constraints (”must-link”
or ”cannot-link”), which indicate that two nodes must be in the same community
or must be in different communities. Such constraints have been implemented in
several algorithms, including a modularity-based method [18], a spectral analysis
method [12, 23], and methods based on matrix factorization [22, 23]. Instead of
constraints, other algorithms use node labels as prior knowledge to improve the
process of community detection [17]. In [19], the authors propose a method that
uses a semi-supervised label propagation algorithm based on node labels and
negative information, where a node does not belong to a specific community.
The clear majority of semi-supervised algorithms in this area aim at detecting
disjoint communities, whereas many real-world networks contain overlapping
communities [1]. To the best of our knowledge, very little work has been done
in the context of finding overlapping communities context. In [8], a small set of
nodes called seed nodes was used, whose affinities to a community is provided
as prior knowledge to infer the rest of the nodes affinities in the network. In the
remainder of this paper we focus on the problem of semi-supervised community
finding suitable for application to networks containing overlapping communities.

3 Methods

3.1 Greedy Clique Expansion (GCE)

The GCE [16] community finding algorithm initially finds maximal cliques as
seeds, and subsequently expands these seeds into larger communities in a greedy
fashion, by optimizing a local fitness function. Given a network G, a user-
specified minimum clique size k, and a minimum community distance e, the
GCE algorithm involves the following steps:

1. Find the seeds, which are all maximal cliques in G with at least k nodes.
2. Choose the largest unexpanded seed and greedily expand it into a candidate
community C0 by using a community fitness function FS until adding any
node would not increase the fitness value.
3. Test the distance between C0 and all previously accepted communities. If
the distance between C0 and an existing community C is < e, then C and
C0 are deemed to be near-duplicates, so discard C0. Otherwise, accept C0.
4. Repeat steps 2 and 3 until no more seeds remain

GCE employs a fitness function defined [14] to expand each seed, which defines
the fitness community of a given community S in terms of its internal and
external degrees (kin and kout ) as follows
S
kin
FS = S + k S )α
(1)
(kin out
where the parameter α typically takes values in the range [0.9, 1.5]. Generally,
we can summarize the greedy expansion step using the fitness function FS as
follows:
1. For each node v in the frontier of a given seed S, calculate v’s fitness value,
i.e., the amount by which the community fitness of S would change if the
node v was added to S.
2. Choose the node that has the maximum fitness value, vmax .
3. If the fitness value vmax is positive, then insert the node v into S and go
back to Step 1. Otherwise, terminate and return S.
To discard near-duplicate communities in Step 4 of the GCE algorithm, the
authors proposed the use of an overlap-based measure of distance between two
communities:
|S ∩ S 0 |
δE (S, S 0 ) = 1 − (2)
min(|S|, |S 0 |)
Given two communities S and S0, Eqn. 2 measures the number of nodes in the
smaller community that are not included in the larger one. So, for a given set of
communities, a near-duplicate community of a given community S would be all
the communities that are within a distance e of S.

3.2 Pairwise Constraints in Community Detection
Given a network that contains a set of nodes V , semi-supervised pairwise con-
straints typically take two possible forms:
1. Must-link constraints specify that two nodes must be in the same community.
Let CM L be the must-link constraint set: ∀ vi , vj ∈ V where i 6= j, (vi , vj )
∈ CM L indicates that the two nodes vi and vj must be assigned to the same
community.
2. Cannot-link constraints specify that two nodes must be in different commu-
nities. Let CCL be the cannot-link constraint set: ∀ vi , vj ∈ V where i 6= j,
(vi , vj ) ∈ CCL indicates that the two nodes vi and vj must be assigned to
two distinct communities.
The simplest approach to selecting this form of constraints is to naively select
a pair of nodes (vi , vj ) at random, and query the oracle about whether the pair
should share a must-link or cannot-link constraint. This process can be repeated
to select the required number of constraints or until some supervision budget is
exhausted.
In non-overlapping community finding, must-link constraints have what is
referred to as a transitive property, where a third must-link relationship can be
inferred from two other associated must-link constraint pairs. So, if (vi , vj ) ∈
CM L , and (vi , vk ) ∈ CM L , then we can also infer that (vj , vk ) ∈ CM L (see the
first example in Fig. 1).
However, incorporating constraints into the context of overlapping commu-
nities is more challenging. This is because the transitive property does not hold
%& %' %& %' %& %'
!" !"
!"
!$ !# !$ !# !$ !#

Non-overlapping Case 1 Overlapping Case
Fig. 1: In the non-overlapping context, the transitive property allows us to infer a third
must-link constraint from two existing must-link constraints. However, this does not
automatically apply in the overlapping context, where two possible situations exist.

Must-link Cannot-link Must-link Cannot-link
Set Set Set Set
!" !# !) !* !" !# !) !*
!" !$ !+ !, !" !$ !+ !,
!" !% !& !% !" !% !- !,
!" !& !# !& !" !& !. !/
!' !( !$ !% !' !( !0 !1
!& !%
!# !&
!$ !%
!$ !#

Fig. 2: Example of expanding 5 pairwise constraints. Inserting the derived cannot-link
pairs from all connected must-link pairs will double the size of the cannot-link set.

here (see the second example in Fig. 1). Specifically, if (vi , vj ) ∈ CM L , and (vi , vk )
∈ CM L , there are two possible scenarios for the pair (vj , vk ). It can be the case
that either (vj , vk ) ∈ CM L or (vj , vk ) ∈ CCL . This is because an overlapping
node vj can have a must-link constraint with both vi and vi , yet these two nodes
could belong to two different communities. However, it is also possible that all
three nodes are in fact in the same community. Unless we explicitly inform the
algorithm about whether a must-link or cannot-link constraint exists for the pair
(vj , vk ), we cannot reliably distinguish between the two cases. If the network has
highly-overlapping communities, then this problematic situation will occur more
frequently. If we naively attempt to incorporate pairwise constraints into over-
lapping community finding without taking this situation into account, it is likely
that the quality of the resulting communities can potentially decrease even as
more constraints are added.
To resolve this issue, after selecting pairwise constraints, we need to explicitly
detect every cannot-link pair that derived from any two connect must-link pairs
and insert it to the cannot link set. However, this will significantly increase the
set of cannot-link constraints. For example, if we want to feed only five pairwise
constraints, each as shown in Fig. 2, inserting the required cannot-link pairs
will double the size of the cannot-link set. In the next section, we introduce an
approach to address this issue.

3.3 Semi-Supervised GCE
We now describe our proposed approach for overlapping community finding
with limited supervision, referred to as Pairwise Constrained GCE (PC-GCE).
This approach consists of two stages: The first stage includes selecting and pre-
processing constraints and resolves the problem of the lack of the transitive
property for must-link constraints in the context of overlapping communities.
The second stage supplies the resulting constraints to the GCE algorithm to
process them during community detection. In the following, these stages are de-
scribed in more detail.

Stage 1: Selecting and pre-processing constraints. In this stage, we can
treat the set of pairwise constraints as a new graph, where an edge exists between
two nodes from the original network if they share a pairwise constraint (either
must-link or cannot-link). Then we look for all possible forbidden triads among
the nodes involved in the must-link set. Given three nodes A, B, C, a forbidden
triad (or open triad ) occurs when A is connected to B and C, but no edge exists
between B and C. In our pre-processing step, we look for such cases — i.e. where
we do not know whether a must-link or cannot-link exists between a pair of nodes
B and C. To control the size of the constraints set, we greedily expand it until
we reach a pre-defined maximum size. This stage can be summarized as follows
(see also Fig. 3):
1. Choose a small initial random set of both must-link and cannot-link sets.
2. Find all possible forbidden triad cases in the must-link set to query the oracle
about their relationship.
3. If their relationship is must-link insert it into must-link set, otherwise insert
it into cannot-link set.
4. Repeat all steps until the set reaches the maximum size.
At the end of this stage, the pairwise constraints set is ready to be supplied to
the GCE algorithm for community detection.
Stage 2: Pairwise Constrained GCE (PC-GCE). During the community
detection phase, we incorporate only cannot-link constraints into the existing
GCE algorithm as follows (see Fig. 4 for an illustration):
1. Find seeds, which are all maximal cliques in G with at least k nodes.
2. Choose the largest unexpanded seed and greedily expand it to a candidate
community C0 by using a community fitness function (Eqn. 1) until adding
any node no longer increases the fitness value. However, during this expan-
sion process, do not add any node, which has a cannot link relationship with
any existing node in the seed.
3. Check for the existence of any cannot-link constraints among all pairs of
nodes in C0. If such a pair exists, calculate the fitness for both nodes relative
to C0, and remove the one with the lower value.
4. Test the distance between the community C0 and all of the already accepted
communities, using Eqn. 2. If the distance between C0 and any accepted
community C is