=Paper=
{{Paper
|id=Vol-1623/papercpr6
|storemode=property
|title=The p-median Problem with Order for Two-Source Clustering
|pdfUrl=https://ceur-ws.org/Vol-1623/papercpr6.pdf
|volume=Vol-1623
|authors=Xenia Klimentova, Anton Ushakov, Igor Vasilyev
|dblpUrl=https://dblp.org/rec/conf/door/KlimentovaUV16
}}
==The p-median Problem with Order for Two-Source Clustering==
The p-median Problem with Order for Two-source Clustering⋆ Xenia Klimentova1 , Anton V. Ushakov2 , and Igor Vasilyev2 1 INESC TEC, Campus da FEUP, Rua Dr. Roberto Frias, 378, 4200-465 Porto, Portugal, xenia.klimentova@inesctec.pt 2 Matrosov Institute for System Dynamics and Control Theory, Lermontov 134, 664033, Irkutsk, Russia, {aushakov, vil}@icc.ru. Abstract. In this paper we present a hybrid approach to integrative cluster- ing based on the p-median problem with clients’ preferences. We formulate the problem of simultaneous clustering of a set of objects, characterized by two sets of features, as a bi-level p-median model. An exact approach involving a branch- and-cut method combined with the simulated annealing algorithm is used, that allows one to find a two-source clustering. The proposed approach is compared with some well-known mathematical optimisation based clustering techniques applied to the NCI-60 tumour cell line anticancer drug screen dataset. The re- sults obtained demonstrate the applicability of our approach to find competitive integrative clusterings. 1 Introduction Recent advances in microarray technologies give rise to collecting huge amount of high- dimensional omic data, generated simultaneously for the same biological samples. A huge number of techniques and approaches has been proposed to carefully combine and analyse continuous, discrete and categorical multisource data, mainly based on probabilistic and statistical modelling. Recent reviews [17, 20] cover a wide range of approaches to omic data integration. Modern techniques in statistics for integrative analyses of cancer data, including incorporating multiple heterogeneous genomic data types, are also reviewed in [19, 27]. This paper focuses on the cluster analysis, which is one of the main unsupervised learning methods. Clustering is an exploratory tool that consists in dividing a set of objects (biological samples, observations) into subsets (clusters) containing similar ob- jects, while objects from different clusters are dissimilar. Despite a large number of general clustering algorithms having been proposed, there is a lack of such methods for ⋆ The research of A. Ushakov and I. Vasilyev is supported by the Russian Science Foun- dation, research project No. 15-11-20015, the work of X. Klimentova is supported by the FCT — Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within project SFRH/BPD/101134/2014. Copyright ⃝ c by the paper’s authors. Copying permitted for private and academic purposes. In: A. Kononov et al. (eds.): DOOR 2016, Vladivostok, Russia, published at http://ceur-ws.org The p-median Problem with Order for Two-source Clustering 537 efficient integrative clustering of multisource data. The simplest and obvious approach to integrative clustering consists in concatenating the feature vectors of the consid- ered objects or samples followed by applying some well-known clustering algorithms to analyse the data obtained. Though this approach may be useful in some rare instances, it is rather inflexible due to its impossibility of capturing important features that are specific to each dataset and in the case of heterogeneous data. We propose an approach to integrative clustering of two-source data which is based on bi-level integer linear programming and metaheuristic optimization. As opposed to previously developed integrative clustering methods, which are based on modelling each dataset using mixture models or post hoc combining multiple base clusterings, we propose a hybrid integer programming approach based on the bi-level p-median problem with clients’ preferences. The approach is compared to those presented in [10], indicating that our approach can provide clustering solutions competitive with other distance-based integrative clustering algorithms. One of the first gene-drug integrative clustering approach was proposed in [21] and was based on a hierarchical clustering algorithm. Other mathematical optimization based approaches that have been applied to simultaneously clustering gene expression and drug activity profiles are the Soft Topographic Vector Quantization [8], relational k-means [11], genetic programming [3], and a consensus p-median approach [10]. The paper is structured as follows. The generic p-median clustering model is de- scribed in section 2. In section 3 an extension to the bi-level problem is presented. Finally, section 4 illustrates the comparison of different approaches and gives some concluding remarks. 2 The p-median clustering problem The p-median problem, proposed by [13], consists in choosing p sites for locating facili- ties from the set of potential places in order to minimize the sum of weighted distances between customers and open facilities. This model can naturally be formulated in the form of the following combinatorial optimization problem: {∑ } min min dij : |S| = p , S⊆I i∈S j∈J where |I| = {1, . . . , m} is a set of potential facility sites, J = {1, . . . , l} is a set of customers, and dij — the distance (or the cost of satisfying the demand) between customer j ∈ J and the facility located at site i ∈ I. The p-median problem is known to be NP-hard in the strong sense and now sup- posed to be well-studied. Apart from a number of applications, the p-median problem is a powerful tool for the cluster analysis. To transform the problem into a clustering tool, let us assume that I = J, i.e. the potential sites and customer locations coincide. In this case the p-median problem (also referred as the minimum-sum-of-stars clustering [16]), can be formulated on a simple digraph G(I, A) with the node set I corresponding to the set of samples to be clustered and the arc set A = {(i, j) : i, j ∈ I; i ̸= j}, each arc (i, j) ∈ A has a weight (distance) dij > 0 measuring the dissimilarity between each pair of samples. Now the p-median problem is to select p nodes, those are called medians, in 538 X. Klimentova, A. Ushakov, I. Vasilyev order to minimize the sum of the distances between each node and its nearest median. Any feasible solution to the problem consists of p stars with medians in the centres. Each star is a cluster and the median is a cluster representative (prototype). The p-median model has several important advantages over centroid-based para- metric clustering approaches, like k-means and its variations. First of all, modern hy- brid heuristics and exact approaches are able to find optimal (global) or suboptimal solutions to the p-median instances of huge size (e.g. see [4, 12, 15]), while k-means or k-means type algorithms, like PAM, CLARA, CLARANS, k-medians, are heuristics that converge fast to a local optimum. Secondly, the classical k-means algorithm pre- supposes using the squared Euclidean distance as the measure of similarity, which not always results in a good clustering. For other types of parametric clustering problems, several algorithms using k-means type iterative relocation scheme have been proposed, e.g. k-medians in the case of Manhattan distance or the Linde-Buzo-Gray algorithm when the Itakura-Saito distance is considered [6]. The p-median problem can be formulated as an integer linear program. For each node i ∈ I, let us consider a binary variable yi , which takes the value 1 if node i is a median (sample i is a cluster representative), and takes the value 0 otherwise. For each arc (i, j) ∈ A let xij be the binary variable which is equal to 1 if node i is the closest median to node j (sample j is assigned to cluster i), and takes the value 0 otherwise. Let also δ − (j) = {i ∈ I| (i, j) ∈ A} be a set of nodes of the graph G assigned to j with outgoing arcs, and let δ + (i) = {j ∈ I| (i, j) ∈ A} be a set of nodes assigned to i with the arcs leaving node i. Thus, the p-median problem can be written as follows [5]: ∑ min dij xij , (1) (x,y) (i,j)∈A ∑ xij + yj = 1, j ∈ I, (2) i∈δ − (j) xij ≤ yi , i ∈ I, j ∈ δ + (i), (3) ∑ yi = p, (4) i∈I yi ∈ {0, 1}, i ∈ I, (5) xij ∈ {0, 1}, (i, j) ∈ A. (6) Objective function (1) minimizes the overall total sum of distances between all nodes and the closest medians. Constraints (2) guarantee that either node j is a median or it is assigned to a median. Constraints (3) ensure that each node can only be assigned to medians. Constraint (4) enforces the number of medians to be p. Several studies considered the p-median model for clustering various types of data, e.g. psychological data [18] or gene expression profiles and drug responses [10]. More- over, p-median model is proved to be a powerful clustering tool that provides high qual- ity clustering solutions outperforming those provided by CLARA and CLARANS [15] or k-means and k-means++ [23]. The p-median Problem with Order for Two-source Clustering 539 3 Bi-level p-median clustering Suppose that we are given a set I of objects (cell lines) characterized by two feature vectors, i.e. representing gene expression and drug activity profiles. Here we propose an approach to two-source integrative clustering based on a bi-level version of the p-median problem and study its application to clustering cancer cell lines. Let two matrices dij ≥ 0 and gij ≥ 0 indicate the dissimilarity of a pair of cell lines (i, j) ∈ A = {(i, j) : i ∈ I, j ∈ I, i ̸= j} in the drug and gene space respectively. Then, let us consider the following bi-level p-median clustering problem: ∑ min dij x∗ij (y), (7) y (i,j)∈A ∑ yi = p, (8) i∈I yi ∈ {0, 1}, i ∈ I, (9) where x∗ij (y) is the optimal solution to the following lower-level problem: ∑ min gij xij , (10) x (i,j)∈A ∑ xij + yj = 1, j ∈ I, (11) i∈δ − (j) xij ≤ yi , i ∈ I, j ∈ δ + (i), (12) xij ∈ {0, 1}, (i, j) ∈ A. (13) On the upper level on seeks for p cell lines to be the cluster representatives such that the sum of dissimilarities of drug activity profiles between cell lines and its closest representatives is minimised. On the lower level all cell lines are assigned to the cluster representatives, selected at the first level, in order to minimise the sum of dissimilarities of gene expression profiles between cell lines belonging to the same cluster and its closest representatives. In other words, the decision about which cell lines will be the medians (cluster representatives) is made on the first level according to the matrix {dij }, while assigning remaining cell lines to clusters is performed on the second level taking into account the dissimilarity matrix {gij }, (i, j) ∈ A. Thus, if gij ≤ gkj and both i, k ∈ I are cluster representatives, then cell line j is assigned to cluster Ci . Note that when all the columns of the lower-level matrix {gij } are sorted in ascend- ing order or when dij = gij , the problem (7)–(13) is a particular case of the p-median problem. Thus, it is NP-hard in the strong sense and does not belong to the class APX [2]. In general case the solution x∗ij (y) may be not unique, then one has to specify what kind of solution is supposed to be optimal to the clustering problem (7)–(13). Two extreme cases, i.e. cooperative and non-cooperative decision-making strategies, are often emphasised [1], depending on whether cell lines on the lower level are as- signed to their closest representatives in order to respectively minimise or maximise the value of the upper-level objective. To avoid these cases for the sake of simplicity, we suppose that all elements of a column j of the matrix {gij } are distinct [24]. 540 X. Klimentova, A. Ushakov, I. Vasilyev The bi-level p-median clustering problem can be reduced to a single-level integer linear program [14, 7]. Let us denote Wij = {k ∈ I : gij < gkj }, (i, j) ∈ A, then the single-level problem is ∑ min dij xij , (14) (x,y) (i,j)∈A ∑ yi + xkj ≤ 1, i ∈ I, (i, j) ∈ A, (15) k∈Wij ∑ yi = p, (16) i∈I ∑ xij + yj = 1, j ∈ I, (17) i∈δ − (j) xij ≤ yi , i ∈ I, j ∈ δ + (i), (18) yi ∈ {0, 1}, i ∈ I, (19) xij ∈ {0, 1}, (i, j) ∈ A. (20) This formulation is identical to the p-median except constraints (15), which guarantee that if i ∈ I is a cluster representative, then a cell line j ∈ I is not assigned to more dissimilar (according to gene expression profiles) representatives from the set Wij . Thus, xij , (i, j) ∈ A is the optimal solution to the lower-level problem for any yi , i ∈ I. Note that one can similarly consider the bi-level p-median clustering problem with the matrices {gij } and {dij } on the upper and lower levels respectively. To find the optimal solution to the bi-level p-median clustering problem, we have developed an exact approach including a branch-and-cut algorithm and a metaheuristic to search for initial upper bounds of the optimal value, which is detailed in [24–26]. The method is based on the family facet-defining inequalities proposed in [26] for the simple- plant location problem with order. For some node j and subset S ⊆ I let us denote by bj (S) ∈ S the nearest node from S for j, i.e. Wij ⊂ Wbj (S)j for all i ∈ S \ {bj (S)}. Theorem 1 For all i, u, v ∈ I the inequalities ∑ ∑ xku + xkv + yt ≤ 1, (21) k∈Wiu t k∈Uiu where t = bv (I \Wiu ) and Uiu t = I \(Wiu ∪{t}), are valid for the polytope of the problem (14)-(20). The cutting plane method for this family of inequalities was implemented using two computational tricks for reduction of the number of violated inequalities. On each iteration we add the inequalities corresponding to the most distant hyperplane from the current fractional solution, preventing the almost parallel inequalities from adding. To find an upper bound of the optimal solution the standard scheme of simulated annealing method was implemented, using flip-neighbourhood (see [26] for more detail). Finally, as an exact method we have used Cut-and-Branch scheme, one of the effec- tive methods tested in previous works for a similar problem. In root node of branching The p-median Problem with Order for Two-source Clustering 541 tree the cutting plane method is run and the new formulation obtained, as well as an upper bound provided by the simulated annealing, are used in order to solve the problem with branch-and-bound algorithm. 4 Clustering analysis of the NCI-60 dataset The proposed approach was implemented as a program using C++ programming lan- guage. The MIP solver FICO Xpress callable library has been used as a branch-and-cut framework. We compare our results with those obtained in the paper [10] as well as with other integrative clustering techniques presented in that paper. The number of clusters p was equal to 9. As a measure of dissimilarity between cell lines both in the drug and gene spaces, we apply one of the most widely used measure based on the Pearson correlation coefficient, i.e. distij = 1 − corrij , (i, j) ∈ A [9]. We report our results performing two-source cluster analysis of National Cancer Institute (NCI)-60 panel of human tumour cancer cell lines [22]. The dataset consists of 60 cell lines from 9 cancer tissues. To compare our approach, we use the same dataset as was previously considered in [10, 21]. It includes 1376 gene expression profiles and 1400 drug activity patterns. Thus, I is a set of m = 60 cell lines, and the dissimilarity matrices {gij } and {dij } are computed for each pair (i, j) ∈ A as distij , using the given gene expression and drug response data. To evaluate the quality of a cluster solution we use the average Pearson correlation coefficient ∑p 2 ∑ P = corrij , m(|Ck | − 1) k=1 i,j∈{1,...,m}:i