Introduction

The European Physical Journal B 66 (2008) 409-418. doi:10.1140/epjb/ e2008

1613-0073

10.1016/j

Constraint Community Detection: modelling approaches with applications

Oksana Pichugina

Lyudmyla Kirichenko

Yurii Skob

Olha Matsiy

2 0 Kharkiv National University of Radio Electronics , 14 Nauki Avenue, Kharkiv, 61166 Ukraine 1 National Aerospace University "Kharkiv Aviation Institute" , 17 Chkalova Street, Kharkiv, 61070 Ukraine 2 V. N. Karazin Kharkiv National University , 4 Svobody Sq., Kharkiv, 61022 , Ukraine

2023

2353 20 22

Community Detection (CD) is a fundamental issue in Network Analysis, focusing on identifying densely connected node groups within a network. Its broader interpretation, Constrained Community Detection (CCD), emerges when supplementary constraints are applied, expanding the scope of the problem. CD has been extensively explored in Network Analysis, boasting numerous developed exact and approximate methods. Conversely, CCD encompasses a more extensive array of real-world issues and applications within Network Analysis. There is a significant need to broaden the spectrum of CD problem variants by establishing rigorous mathematical models. These models would serve as the foundation for developing new exact and heuristic algorithms to solve these problems. This paper investigates various approaches to CCD problem (CCDP) modeling. Specifically, we introduce a novel method for problem modeling that encompasses a broader range of constraints and establish its correlation with the conventional CCDP modeling approach. Additionally, we demonstrate its distinct advantages. The integration of these approaches presents opportunities for extending the class of formalized CCDP as polynomial optimization problems. Consequently, these problems can be eficiently addressed using contemporary nonlinear solvers and can also be transformed into solvable QUBO models applicable to both quantum and digital annealing.

eol>Community Detection Modularity Constraint Binary Optimization Integer Programming Polynomial Optimization Network Node partition

Introduction

Community Detection and Edge Clustering are two main techniques of network analysis (NA) aiming to uncover structures and patterns within networks [3, 4, 5, 6, 7, 8, 9, 10, 11, 12].

Community detection (CD, graph partitioning or network clustering) is the process of identifying groups of nodes within a network called clusters that are more densely connected than nodes in other clusters.

CD aims to find natural subdivisions into dense clusters within the network called communities. Community detection is a fundamental aspect of network analysis with wide-ranging implications across various domains. Identifying communities within networks provides valuable insights into complex systems’ structure, behavior, and function. Here are some reasons demonstrating the importance of community detection: it provides an understanding of complex systems, allows performing Social Network Analysis in particular, to explore collaboration networks and detect criminal networks, performs Cultural and Societal Analysis; enables solving many research problems in Biology and Bioinformatics, e.g., in Epidemiology and Disease Spread; helps to improve management of Recommendation Systems, Urban Planning and Transportation Systems, made proper Market Segmentation and study Customer Behavior; boost financial and economic systems by Fraud Detection and enhancement Cybersecurity issues; rise performance of real networks by improving their Robustness and Resilience; solve numerous problems of Natural Language processing by exploring Semantic Web and Content Organization. Community detection has a wide range of applications across various ifelds, where the goal is to uncover hidden structures, patterns, and relationships within networks. To understand the numerous applications in the listed research domain, we outline some applications [1, 2, 8, 11, 13, 14, 15, 16]: 1. Identifying groups of friends or communities in social networks, analyzing information flow and influence propagation, detecting online communities in forums and social media platforms. 2. Discovering research communities and collaborations in academic citation networks, identifying influential researchers or papers within specific fields. 3. Investigating criminal networks and identifying key actors and their associations, analyzing patterns of criminal activity and connections. 4. Identifying protein complexes and functional modules in protein-protein interaction networks, analyzing genetic regulatory networks and identifying co-regulated gene groups. 5. Tracking the spread of diseases through contact networks and identifying potential hotspots, analyzing transmission patterns and identifying groups at higher risk. 6. Enhancing recommendation algorithms by considering communities of users with similar interests, grouping web pages with similar content for better search results and content organization. 7. Optimizing routing and resource allocation in communication networks, Identifying clusters of devices in network trafic analysis. 8. Analyzing transportation networks to identify hubs, sub-communities, and trafic patterns, designing eficient public transportation routes based on community structure. 9. Segmenting customers based on purchasing behavior and preferences, analyzing social interactions to understand consumer trends. 10. Identifying groups of users engaging in coordinated fraudulent activities, detecting anomalies and security threats by analyzing network behavior. 11. Organizing and categorizing web content based on thematic communities, enhancing search results by considering community relevance.

Real-world CD problems are often accompanied by additional constraints on nodes, edges and communities, complicating their modeling and significantly afecting solution methods. This paper studies the issue of modeling CD problems with additional constraints (Constraint Community Detection). In particular, we propose a new approach to modeling the problems as Boolean constraint optimization problems.

1. Prerequisites

The application domain of CD is far from limited to the above list. Conducting CD is valuable whenever understanding network structure and relationships is important for making informed decisions or gaining insights into complex systems.

CD is conducted on networks, and it is necessary to distinguish networks and graphs. The terms "network" and "graph" are closely related concepts in the NA and Graph Theory field, but they are used in slightly diferent ways and contexts.

A graph is a mathematical object that consists of a set of nodes (vertices) and a set of edges that connect pairs of nodes. It is denoted as = ( , ), where = {1, ..., } is a node-set (vertex set), is an edge set. Graphs represent relationships or connections between diferent entities of various types. Graphs can be directed or undirected, weighted or unweighted.

A network is a collection of interconnected elements. Networks can represent a wide range of real-world systems with relationships or interactions between entities. A network can be represented by nodes representing entities and edges representing interactions of the entities, i.e. any network is representable by a certain graph.

The community detection problem (CDP) in a network is formulated as an optimization problem, where the goal is to find a partition of nodes into communities that maximizes a certain objective function. Diferent objective functions capture diferent aspects of community structure. Modularity [11] and conductance [12] are the two most common.

1.1. Modularity optimization CDP

Modularity quantifies the diference between the observed number of edges within communities and the expected number of edges in a random graph. In this paper, modularity is chosen as a criterion of optimization.

The popular formalization (CDP statement) is • Input: a network by its weighted adjacency matrix = [ ], ∈ ; the number of desired communities (or an upper bound on the number of communities). • Output: a partition of the nodes into communities.

• Objective function: maximization of modularity .

The modularity function ( | ) assesses the extent to which a partition of network nodes corresponds to the densely-connected node subsets in the network and is defined as ( | ) = 1 ∑ ( − ) { = }, 2 , ∈ 2 ( 1 ) where • = ∑ ∈ is the weighted degree of the node , 2 = ∑, ∈ is the total weight of the network (in particular, is the number of edges in if the graph is unweighted), the modularity function ( 1 ):

[1, ] is the community assignment for the node ∈ .

• { = } is an indicator function equal to one if = , otherwise, it is annulled; = ( ) ∈ The problem (further referred to as Problem 1) is to find the community assignment ∗ maximizing (Problem 1): ( ∗| ) = m∈ax ( | ), where  is a set of community node assignment over . satisfying the following constraints.

The vector of integer variables = [ ] ∈ ℤ>0

1. Labels of communities are in the range [1, ]:

is the network-wide node community assignment } ⊆ = and = ⇒ = .

} ⊆ = and ≠ ⇒ ≠ . formalization: ∀ ∈ [1, ] ∃ ∈ ∶ = or

Also, if is an exact number of communities, the following constraints are used for mathematical | | ≥ 1, ∈ [1, ].

1 ≤ ≤ , ∈ . 2. Partition constraints. Each node must belong to exactly one community. In other words, induces a partition of . partition of the node set if To formalize this condition, let 1, ..., be a set of communities induced by . They form a | 1| + ...| | = , ∩ ′ = ∅, < ′, , ′ ∈ [1, ].

In terms of the introduced integer variables, now, the communities can be represented as ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) 3. Symmetry constraints. 4. Transitivity constraints: • For every triple {, ,

community, then , • For every triple {, , communities, then , are also in diferent communities, i.e. } ⊆ , if , are in the same community and ,

are in diferent = { ∈ ∶ = }, ∈ [1, ].

∀{, } ⊆ = ⇔ = . are in the same community, i.e.

} ⊆ , if , are in the same community and , are in the same

1.2. Approaches to CD

Since CDP is NP-hard, various heuristic and approximation community detection algorithms (CDAs) are used to find approximate solutions efectively [3, 4, 5, 6, 7, 8, 10, 11, 12, 17, 18, 19, 20, 21]. The most common algorithms are: • Louvain Method [17], which is a greedy optimization algorithm iteratively improving modularity by moving nodes between communities; • Label Propagation [12], where nodes iteratively adopt the labels of their neighbors until stable labeling is achieved; • Spectral Clustering [11] involving computing the eigenvectors of specific matrices derived from the graph to find clusters; • The Girvan and Newman algorithm [22] is a hierarchical community detection method that divides communities by eliminating edges with higher betweenness; • The Clauset community detection algorithm [13] identifies communities by optimizing the modularity of network partitions; • The Brandes et al. Community Detection Algorithm is a greedy agglomerative method that utilizes Linear Integer Programming to optimize modularity for community detection; • The Spin Glass Algorithm [23] is a hierarchical agglomerative approach that minimizes the

Hamiltonian of the Potts-like spin model, with spin states symbolizing communities; • The Walk Trap Algorithm developed by Pons and Latapy [24] is a hierarchical agglomerative method rooted in random walks, initiating from individual clusters.

Most of these methods are applied directly to Problem 1, where the constraints ( 3 )-( 7 ) are present in-explicitly. In addition, CDP can also be easily reformulated as a binary optimization problem (see below) and, respectively, be solved by Integer Programming methods [25] and other Nonlinear Programming techniques [26].

1.3. Modularity optimization CCDP

Constrained Community Detection (Constrained CD, CCD) [9, 14] is a generalization of the standard community detection problem where additional constraints, such as ( 8 ), are introduced to guide the process of identifying communities within a network. These constraints can be in the form of prior knowledge, user preferences, or specific requirements related to the network’s properties. The goal is to incorporate these constraints into the CD process while optimizing an objective function that captures the desired community structure, such as modularity and conductance.

Among known constraints in CCD are: • Community Size Constraints: for preventing the formation of very small or very large communities, enforcing constraints on the sizes of communities are imposed; Let , be a lower and an upper bound on the size of the community , ∈ [1, ]. In these notations, the constraint is ≤ | | ≤ , ∈ [1, ]. ( 9 ) • Balance Constraints: balance requirement for certain community characteristics such as size of communities, distribution of node degrees and node/edge attributes.

Suppose we are given an upper bound Δ on the diference of sizes of two communities in the community assignment . Mathematically, it can be expressed as

(| | − | ′ |)2 ≤ Δ2, < ′, , ′ ∈ [1, ]. • Seed Nodes or Labels: the goal is to ensure that certain nodes are assigned to the specified communities.

Let ⊆ and ⊂ [1, ] be a set of communities, where the node can be assigned for ∈ .

This can be written as • Conflicting constraints : these constraints express the condition that some nodes must be assigned to diferent communities. For their formalization, we introduce a set , whose elements are collections of sets of conflicting nodes/ They are assigned to diferent communities if: ∈ ⊆ [1, ], ∈ . = { ⊆

∶ ∀{, } ⊆ ≠ }. = { ⊆

∶ ∀{, } ⊆ = } • Forcing constraints: these constraints require certain nodes to be assigned to the same community. Similar to conflict ones, we formalize them as follows. First, we introduce a set consisting of the sets of forcing nodes, i.e. the ones that need to be assigned to the same community. • Hierarchical Constraints: require hierarchical structuring of the detected communities, where the communities are nested within larger communities; • Similarity Constraints: the goal is to ensure a certain level of similarity between nodes and edges, including their specific attributes.

To Problem 1 complemented by the constraints ( 9 )-( 13 ), we will refer to as Problem 1.G. It would be very desirable to reformulate Problem 1 and Problem 1.G as integer optimization problems with variables forming the vector , and for this, the approaches described in [27, 28] can be applied. However, this has not yet been achieved. That is why other modeling approaches are needed.

Additional constraints in CDP require implementing more complex optimization methods than the standard CD approaches because it is complicated to incorporate additional constraints in the standard community detection heuristics. These methods utilize mathematical models of CCD problems (CCDP) that are not uniquely determined. Accordingly, the efectiveness of using the CCD methods highly depends on these underlying mathematical models.

In this paper, we study approaches to CCDP modelling. In particular, we present a new approach to problem modeling that covers a larger number of constraints and establish its connection with the standard approach to CCDP modeling. Combining these two approaches opens up prospects for expanding the class of formalized CCD problems in the form of polynomial optimization problems. Respectively, these CCDPs can be solved efectively using contemporary nonlinear solvers [29] and also reduced to QUBO models [30] solvable by quantum and digital annealers. ( 10 ) ( 11 ) ( 12 ) ( 13 )

2. Modelling CCD problems as binary programs

We will formalize additional constraints using binary variables. In order to accomplish this, first, we reformulate the integer programming problem ( 2 ), whose dimension is , as a binary optimization problem of higher dimension.

2.1. Standard CCD modelling approach

First, we recall the standard approach [9, 14, 31, 32] that allows formalizing some of the abovementioned constraints.

Suppose variables form a square matrix of binary variables of the size :

= [ ], ∈ ∈ × , where = {0, 1},

1 if the nodes and are in the same community, 0, otherwise.

Then the expressions ( 1 ) and ( 2 ) for the objective can be rewritten as d) The constraint ( 6 ) takes the form of e) The constraint ( 7 ) becomes

∀{, } ⊆ = .

f) Likewise ( 3 ), the constraint ( 8 ) cannot be written in terms of -entries. 2. CCDP constraints are only partially representable in terms of the -elements. a) ( 9 ), ( 10 ) and ( 11 ) are not formalized in terms of .

(Problem 2): ( ∗) = m∈ a x× ( ), ( ) = 1 ∑ ( − ) .

2 , ∈ 2

When the matrix ∗ is found, then the community assignment ∗ is formed as follows: an arbitrary node ∈ is selected and is assigned to the community 1 along with all other nodes belonging to the same community as . The process of community assignment continues iteratively for unassigned nodes and the communities 2, ..., .

Certain constraints must be added to ( 15 ). 1. CDP constraints a) The constraint ( 3 ) cannot be written in terms of the variables ( 14 ). However, there are CD algorithms where the number of communities can be specified in advance [12]. b) The constraint ( 4 ) holds due to the above way of the node-to-community assignment. c) The symmetry constraint ( 5 ) is formalized as ( 14 ) ( 15 ) ( 16 ) ( 17 ) ( 18 ) (19) b) The condition ( 12 ) can be written as follows: c) Similar to (20), the constraints ( 13 ) are rewritten as: ∀ ∈ ∶ ∀{, } ⊆ = 0.

∀ ∈ ∶ ∀{, } ⊆ = 1. = [ ] ∈ , ∈[1, ] ∈

× , = { 1 if = , 0, otherwise.

Thus, among the above CD and CCD constraints, five groups are not formalized in terms of and require other approaches for formalization. reduced to 2− if the symmetry constraint is utilized for eliminating the variables 2 The dimension of this CCDP given by ( 15 )-(21) (further referred to as Problem 2.G) is 2. It can be satisfying the relation

≥ .

The advantages of the formalized constraints in Problems 2, 2.G and objective functions are their linearity, while the disadvantage is the unknown, at the moment, the approach to formalizing the rest of the constraints in terms of entries of .

2.2. New CCD modelling approach

In this section, we present our approach to CCD modelling, also utilizing binary variables.

Let us introduce another matrix of binary variables

In terms of these variables, the indicator function in ( 1 ) is representable as that allows rewriting the modularity function ( 2 ) in terms of the introduced binary variables: (20) (21) (22) (23) where where

We came to the binary formulation of problem ( 2 ) (further referred to as Problem 3): find the binary matrix ∗, where the minimum of ′( ) is attained, i.e., { = } = 1( , , ) = 1 − 1 2 =1 ], ∈ ∶ = − 2 ( 1 − 2

), , ∈ . (Problem 3): ′( ∗) = min ′( ), ∈ℝ × where ′( ) is given by (22), and the one-hot constraints hold: =1 ∑ = 1, ∈ .

CCDP constraints.

1. CDP constraints.

Problem 3 is the constrained binary optimization problem with the quadratic objective (22) and linear equality constraints (24). Let us formalize in terms of -entries the rest of the above CDP and nection between elements of the matrices and : a) The condition ( 3 ) holds automatically since the matrix has columns. b) Fulfillment of the condition ( 4 ) is ensured by the one-hot constraint (24). c) In order to write out the symmetry constraint ( 5 ) in terms of , first, we establish con

Making the substitution (25) into ( 17 ), we come to the quadratic equality constraint: d) The constraints ( 6 ) and ( 7 ) also hold automatically. e) The condition ( 8 ) is easily written in terms of the new variables, taking into account that which is, clearly, redundant. | | = ∑ ∈ , ∈ [1, ]: ∀{, } ⊆ ∑ =

∑ , ∈[1, ] a) In terms of , the constraint ( 9 ) looks like: b) In terms of -entries, the constraint ( 10 ) is: ≤ ∑ ≤ , ∈ [1, ].

∈ (∑ − ∑ ′)2 ≤ Δ2, < ′, , ′ ∈ [1, ]. ∈

∈ c) In terms of , the condition ( 11 ) is represented as d) The binary representation of ( 12 ) is ∑ = 1, ∈ .

∈ ∀ ⊆ ∶

∑ {, }⊆ =1 − )2 = |2| ⋅ 2 = | |(| | − 1).

(24) (25) (26) (27) (28) (29) e) The constraint takes the form of ∀ ⊆ ∶

∑ (2 − ∑( {, }⊆ =1

− )2) = | |(| | − 1) that can be simplified to

∀ ⊆ ∶ ∑{, }⊆ ∑ =1( − )2 = 0.

(30)

The dimension of Problem 3 is × .

It is seen that Problem 3 is a binary problem with a convex quadratic objective function and linear constraints, i.e. it is quadratic binary problem. Also, we came to its CCD generalization having the form of Problem 3 with the additional constraints (26)-(30), further referred to as Problem 3.G. These constraints can be present in any combination.

Due to the presence of the quadratic constraints (27)- (30), it belongs to the class of quadratically constrained binary optimization problems with the quadratic objective and constraints. Moreover, the constraints (29) and (30) are quadratic equality constraints, i.e. they are non-convex in contrast to the convex objective (22) and the convex inequality constraints (27). Thus, attacking Problem 3.G we deal with a non-convex binary optimization problem.

3. Discussion

Let us compare the models Problems 2 and 3 and their generalization, Problems 2.G and 3.G.

Comparing the dimensions of the models, the advantage remains with Problems 3 and 3.G since the upper bound on the number of communities is normally much smaller than the number of nodes , respectively,

2 ≫ ⋅ .

In contrast to Problem 3 and Problem 3.G, Problem 2 and its CCD generalization, Problem 2.G, are linear binary problems but do not completely formalized.

Accordingly, only the models Problem 3 and Problem 3.G coped with formalizing the CDP and CCDP as a binary optimization problem. The obtained models have the quadratic objective representing modularity and linear or quadratic constraints. Therefore, the models can be directly solved by general nonlinear solvers, disregarding if the binary variables are supported. thus covering a much wider class of CCDPs.

The variables of the matrices

and have diferent meanings. Namely, the elements of X reflect the relationship between two nodes, while the elements of Y reflect the relationship between a node and a community. Supposedly, other constraints can be formalized by a combination of , -entries,

4. Conclusion

This paper attacks a critical task in Network Analysis called Constrained Community Detection (CCD). Binary optimization was chosen as a modeling tool. A new approach to modeling these problems is presented, significantly expanding the set of formalized constraints. A comparison with the standard modeling approach is made, demonstrating the advantages of our approach. Both approaches can be combined, forming new CCD models in the form of polynomial binary optimization problems. Nonlinear, including polynomial, solvers can be used to solve them. The polynomial and binary nature of the models also allows their reduction to popular QUBO models, which are solved very efectively on quantum and digital annealers.

[1]

Scott , Social Network Analysis, 4th ed., Sage Publications , 2017 .

[2]

M. E. V.

Valkenburg ,

Network

Analysis , 3rd ed., Pearson College Div , 1974 .

[3]

D. K.

Sewell , Model-based edge clustering , Journal of Computational and Graphical Statistics 30 ( 2020 ) 390 - 405 . doi: 10 .1080/10618600. 2020 . 1811104 .

[4]

Farzad ,

Pichugina ,

Koliechkina , Multi - layer community detection, in: 2018 International Conference on Control, Artificial Intelligence, Robotics & Optimization (ICCAIRO) , 2018 , pp. 133 - 140 . doi: 10 .1109/ICCAIRO. 2018 . 00030 .

[5]

Pichugina ,

Farzad , A human communication network model , in: Proceedings of the 12th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer , volume 1614 , CEUR , 2016 , pp. 33 - 40 . URL: https:// ceur-ws. org/ Vol- 3403 /paper21.pdf, issn 1613 - 0073 .

[6]

V. D.

Blondel ,

J.-L.

Guillaume ,

Lambiotte , E. Lefebvre, Fast unfolding of communities in large networks , Journal of Statistical Mechanics: Theory and Experiment 2008 ( 2008 ) P10008 . doi: 10 . 1088/ 1742 - 5468 / 2008 /10/P10008.

[7]

Duch ,

Arenas , Community detection in complex networks using extremal optimization , Physical Review E 72 ( 2005 ) 027104 . doi: 10 .1103/PhysRevE.72.027104.

[8]

Eaton ,

Mansbach , A spin-glass model for semi-supervised community detection , Proceedings of the AAAI Conference on Artificial Intelligence 26 ( 2012 ) 900 - 906 . doi: 10 .1609/aaai. v26i1.8320, number : 1 .

[9]

Eguchi , T. Murata, Constrained community detection in multiplex networks , in: G. L. Ciampaglia , A. Mashhadi , T. Yasseri (Eds.), Social Informatics, Lecture Notes in Computer Science , Springer International Publishing, 2017 , pp. 75 - 87 . doi: 10 .1007/978-3- 319 -67217- 5 _ 6 .

[10]

C. F. A.

Negre ,

Ushijima-Mwesigwa ,

S. M.

Mniszewski , Detecting multiple communities using quantum annealing on the d-wave system , PLOS ONE 15 ( 2020 ) e0227538 . doi: 10 .1371/ journal.pone. 0227538 , publisher: Public Library of Science.

[11]

M. E. J.

Newman , Modularity and community structure in networks , Proceedings of the National Academy of Sciences 103 ( 2006 ) 8577 - 8582 . doi: 10 .1073/pnas.0601602103.

[12]

Wagenseller III ,

Wang , Size matters: A comparative analysis of community detection algorithms , 2017 . URL: http://arxiv.org/abs/1712.01690.

[13]

Clauset ,

M. E. J.

Newman ,

Moore , Finding community structure in very large networks , Physical Review E 70 ( 2004 ) 066111 . doi: 10 .1103/PhysRevE.70.066111.

[14]

W. D.

Viles , A. J. O'Malley , Constrained community detection in social networks , 2017 . URL: http://arxiv.org/abs/1708.04354.

[15]

Lancichinetti ,

Fortunato , Community detection algorithms: A comparative analysis , Physical Review E 80 ( 2009 ) 056117 . doi: 10 .1103/PhysRevE.80.056117, publisher: American Physical Society.

[16]

M. E. J.

Newman , Communities, modules and large-scale structure in networks , Nature Physics 8 ( 2012 ) 25 - 31 . doi: 10 .1038/nphys2162, number: 1 Publisher: Nature Publishing Group.

[17]

V. A.

Traag ,

Waltman , N. J. van Eck , From louvain to leiden: guaranteeing well-connected communities , Scientific Reports 9 ( 2019 ) 5233 . doi: 10 .1038/s41598-019-41695-z, number: 1 Publisher: Nature Publishing Group.

[18]

Girvan ,

M. E. J.

Newman , Community structure in social and biological networks , Proceedings of the National Academy of Sciences 99 ( 2002 ) 7821 - 7826 . doi: 10 .1073/pnas.122653799, publisher: Proceedings of the National Academy of Sciences.