Preliminary Results on Mixed Integer Programming for Searching Maximum Quasi-Bicliques and Large Dense Biclusters Dmitry Ignatov1,3 (0000-0002-6584-8534) dignatov@hse.ru, Polina Ivanova1 (0000-0001-6010-7991) ivanova.p.m@gmail.com, Albina Zamaletdinova1 (0000-0002-4116-9633) aazamaletdinova 1@edu.hse.ru, and Oleg Prokopyev2,1 (0000-0003-2888-8630) droleg@pitt.edu 1 National Research University Higher School of Economics, Russia 2 University of Pittsburgh, USA 3 St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Sciences, Russia Abstract. This short paper is related to the problem of finding maxi- mum quasi-bicliques in a bipartite graph (bigraph). A quasi-biclique in a bigraph is its “almost” complete subgraph; here, we assume that the subgraph is a quasi-biclique if it lacks γ · 100% of the edges to become a biclique. The problem of finding the maximal quasi-biclique(s) consists of finding subset(s) of vertices of an input bigraph such that the induced by these subsets subgraph is a quasi-biclique and its size is maximal. A model based on mixed integer programming (MIP) to search for a quasi-biclique is proposed and tested. Another its variant is tested that simultaneously maximizes both the size of the quasi-biclique and its den- sity, using the least-square criterion similar to the one exploited by Tri- Box method for tricluster generation. Therefore, the output patterns can be called large dense biclusters as well. Keywords: quasi-biclique, maximal quasi-biclique, mixed integer pro- gramming, biclustering, triclustering 1 Introduction There are many data sources that can be represented as bipartite graphs. For example, social network data, where the underlying binary relation represents interactions between people and communities, advertisement data with a set of manufacturers and corresponding set of products etc. Copyright c 2019 for this paper by its author. Copying permitted for private and academic purposes. 2 D.I. Ignatov et al. In this study, we are interested in analysis of such bipartite data and search for largest dense communities, where almost all elements are connected. A situ- ation where all elements of a community are actually involved can be described by a biclique (complete subgraph) of a bipartite graph. Unfortunately, the community completeness requirement excludes almost complete communities frequently met in real-world data. Due to this reason we allow some edges to be absent and introduce the concept of quasi-biclique. In order to bound the size of quasi-biclique we can use the subgraph minimal density or the maximum number of absent edges needed to complete a subgraph. The problem of searching for maximal quasi-clique as well as the problem of searching for maximal clique is NP-hard. Many algorithms that solve the problem are being developed. For instance, Pattillo et al. [1] offered an integer programming model for searching for maximal quasi-clique but the case of bi- partite graph with biclique was not studied in details. Note that quasi-bicliques and dense biclusters can be considered as relaxed versions of formal concept definition [2, 3]. The aim of this paper is to propose a mixed integer programming models for finding a maximum quasi-bicliques in a bipartite graph and compare those models with existing ones. The remainder of the paper is organised as follows. Section 2 proposes two MIP models for quasi-biclique search. In Section 3, we provide the preliminary experimental results. Section 4 concludes the paper. 2 Quasi-biclique searching models Definition 1. A γ-quasi-biclique in a bipartite graph G = (U, V, E) is its bipar- tite induced subgraph G[V 0 , U 0 ] with U 0 ⊆ U , V 0 ⊆ V and its number of edges |E 0 | ≥ γ|V 0 | · |U 0 |, where γ ∈ (0, 1]. The problem of maximum quasi-biclique in a bipartite graph G(U, V, E) with fixed γ: 0 < γ ≤ 1 is to find U 0 ⊆ U and V 0 ⊆ V such that vertex-induced subgraph G[U 0 , V 0 ] is a γ-quasi-biclique of size |U 0 | + |V 0 |, maximum for this graph. Lets denote a maximum γ – quasi-biclique in the graph G by ωγ (G) Model 1. Here, we adapt model F3 from [4] for searching for maximum quasi- bicliques. We introduce the following variables: ui = 1 ⇔ i ∈ U 0 , vj = 1 ⇔ j ∈ (1) (2) V 0 , yij = 1 ⇔ ∃(i, j) ∈ E ∩ (U 0 × V 0 ), zk = 1 ⇔ |U 0 | = k, zk = 1 ⇔ |V 0 | = k, (1) (1) (2) (2) ωl , ωu are the lower and upper bounds for vertex set U 0 , ωl , ωu are the 0 lower and upper bounds for vertex set V . Searching for Maximum Quasi-Bicliques 3 Then we build Model 1.   X X ωγ (G) = max  ui + vj  , (1) u,v,y,z i∈U j∈V (1) (2) ωu ωu X X X under conditions yij ≥ γ n · m · zn(1) · zm (2) , (2) (i,j)∈E (1) (2) n=ωl m=ωl yij ≤ ui , yij ≤ vj , yij ≥ vi + vj − 1, ∀i ∈ U, ∀j ∈ V, (3) (1) (2) ωu ωu X X X X ui = nzn(1) , vj = (2) mzm , (4) i∈U (1) n=ωl j∈V (2) m=ωl (1) (2) ωu ωu X X zn(1) = 1, (2) zm = 1, (5) (1) (2) n=ωl m=ωl ui ∈ {0, 1}, vj ∈ {0, 1} ∀i ∈ U, ∀j ∈ V, yij ∈ {0, 1} ∀i < j, (i, j) ∈ E, (6) (1) (2) zn(1) ≥ 0 ∀n ∈ {ωl , . . . , ωu(1) }, zm (2) ≥ 0 ∀m ∈ {ωl , . . . , ωu(2) }. (7) (1) (2) As in the model F3 we can bound zk and zk and recast them into contin- uous variables. Remark. In Model 1, condition 2 is not linear, so we linearize it further. In the worst-case scenario for dense graphs there are |U | + |V | binary variables and |E| + |U | · |V | continuous variables to be optimized. Model 2. Let us have a look at a different maximizing criteria for our MIP model from [5] dedicated to triclustering generation. By narrowing this criteria to binary setting, it is possible to produce another maximising criteria for model GF3(f) from [4]. For a bipartite graph G(U, V, E) and its subgraph G[C1 , C2 ] the function f is maximized over the density and the size of G[C1 , C2 ]. 2 (|{(i, j) : i ∈ C1 , j ∈ C2 , (i, j) ∈ E}|) f (C1 , C2 ) = ρ2 (G[C1 , C2 ]) · |C1 | · |C2 | = . |C1 | · |C2 | (8) Using variables definitions from the previous model we can apply logarithm and rewrite f as follows:   !   X X X flog (C1 , C2 ) = 2 · log  yij  − log ui − log  v j . (9) (i,j)∈E i∈U j∈V Without any extra bounds on vertex sets of a quasi-biclique and the minimum number of edges in it, the model has 2 · (|U | + |V | + |E|) variables. 4 D.I. Ignatov et al. 3 Experimental Validation The greedy algorithm of finding of maximal γ-quasi-bicliques in a bipartite graph was written in Python 2.7. The MIP models were implemented with the opti- misation package IBM Cplex. All calculations were carried out on a laptop with the MacOs operating system, 2.7 GHz Intel Core i5 processor, RAM 8 GB 1867 MHz. Datasets for testing the performance of the algorithms are taken from [6, 7]: 1. Divorse in US: 9 × 50 vertices, 225 edges; 2. Dutch Elite: 3810 × 937 vertices, 5221 edges; 3. Dutch Elite (TOP-200): 200 × 395 vertices, 877 edges; 4. Movie-Lens (ml-latest-small): 9125 × 20 vertices, 20340 edges. The results of applying the algorithms are presented in the Table 1 for γ = 0.6. For each algorithm the main parameters are indicated: the algorithm running time (time)4 , the number of founded maximum quasi-bicliques (count) and the maximum size of the founded solution. If one of the maximal quasi-biclique components has cardinality 1, this is marked in the table as (U 0 , V 0 ), where U 0 and V 0 are the sizes of the fractions. Table 1: Results of maximum γ-quasi-biclique search. Parameters: γ = 0.6. Model 1 Model 2 Greedy algorithm [8] Data time count size time count size time count size Divorse in US 1.23 s 1 54 3.38 s 1 53 360ms 1 (37, 1) DutchElite (top200) 7602 s 2 (26,1) 181 s 1 14 3s 1 13 DutchElite - - - 3265 s 1 22 1954 s 1 21 Movie-Lens (small) 28068 2 694 5042 s 1 712 2591 s 1 681 4 Results and conclusions One can note, that MIP models work an order of magnitude slower than the [8] algorithm, but they find more quasi-cliques and generally each has larger size. If we consider the results not in terms of speed, but in terms of quality, then Model 2 on the examples of data worked best. This model produced more unique and larger quasi-bicliques than other algorithms (however, only sizes of the maximum quasi-bicliques are mentioned in the tables). 4 Dashes (”-”) in the following tables mean that the algorithm worked too long and did not find a solution. Searching for Maximum Quasi-Bicliques 5 Acknowledgement. The work of Dmitry Ignatov shown in all the sections has been supported by the Russian Science Foundation grant no. 17-11-01276 and performed at St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Sciences, Russia. References 1. Pattillo, J., Veremyev, A., Butenko, S., Boginski, V.: On the maximum quasi-clique problem. Discrete Applied Mathematics 161(1) (2013) 244 – 257 2. Ignatov, D.I., Kuznetsov, S.O., Poelmans, J.: Concept-based biclustering for internet advertisement. In: 12th IEEE International Conference on Data Mining Workshops, ICDM Workshops, Brussels, Belgium, December 10, 2012. (2012) 123–130 3. Kaytoue, M., Kuznetsov, S.O., Macko, J., Napoli, A.: Biclustering meets triadic concept analysis. Ann. Math. Artif. Intell. 70(1-2) (2014) 55–79 4. Veremyev, A.: Exact mip-based approaches for finding maximum quasi-cliques and dense subgraphs. Computational Optimization and Applications 64(1) (May 2016) 177–214 5. Ignatov, D.I., Gnatyshak, D.V., Kuznetsov, S.O., Mirkin, B.G.: Triadic formal concept analysis and triclustering: searching for optimal patterns. Machine Learning 101(1) (Oct 2015) 271–302 6. Borgatti, S.P., Everett, M.G., Freeman, L.C.: UCINET. In: Encyclopedia of Social Network Analysis and Mining. (2014) 2261–2267 7. Batagelj, V., Mrvar, A.: Pajek. In: Encyclopedia of Social Network Analysis and Mining. (2014) 1245–1256 8. Liu, X., Li, J., Wang, L.: Quasi-bicliques: Complexity and binding pairs. In Hu, X., Wang, J., eds.: Computing and Combinatorics, Berlin, Heidelberg, Springer Berlin Heidelberg (2008) 255–264