=Paper= {{Paper |id=Vol-1623/papercpr6 |storemode=property |title=The p-median Problem with Order for Two-Source Clustering |pdfUrl=https://ceur-ws.org/Vol-1623/papercpr6.pdf |volume=Vol-1623 |authors=Xenia Klimentova, Anton Ushakov, Igor Vasilyev |dblpUrl=https://dblp.org/rec/conf/door/KlimentovaUV16 }} ==The p-median Problem with Order for Two-Source Clustering== https://ceur-ws.org/Vol-1623/papercpr6.pdf
                 The p-median Problem with Order
                    for Two-source Clustering⋆

                Xenia Klimentova1 , Anton V. Ushakov2 , and Igor Vasilyev2
 1
     INESC TEC, Campus da FEUP, Rua Dr. Roberto Frias, 378, 4200-465 Porto, Portugal,
                           xenia.klimentova@inesctec.pt
             2
               Matrosov Institute for System Dynamics and Control Theory,
                       Lermontov 134, 664033, Irkutsk, Russia,
                               {aushakov, vil}@icc.ru.



         Abstract. In this paper we present a hybrid approach to integrative cluster-
         ing based on the p-median problem with clients’ preferences. We formulate the
         problem of simultaneous clustering of a set of objects, characterized by two sets
         of features, as a bi-level p-median model. An exact approach involving a branch-
         and-cut method combined with the simulated annealing algorithm is used, that
         allows one to find a two-source clustering. The proposed approach is compared
         with some well-known mathematical optimisation based clustering techniques
         applied to the NCI-60 tumour cell line anticancer drug screen dataset. The re-
         sults obtained demonstrate the applicability of our approach to find competitive
         integrative clusterings.


1      Introduction

Recent advances in microarray technologies give rise to collecting huge amount of high-
dimensional omic data, generated simultaneously for the same biological samples. A
huge number of techniques and approaches has been proposed to carefully combine
and analyse continuous, discrete and categorical multisource data, mainly based on
probabilistic and statistical modelling. Recent reviews [17, 20] cover a wide range of
approaches to omic data integration. Modern techniques in statistics for integrative
analyses of cancer data, including incorporating multiple heterogeneous genomic data
types, are also reviewed in [19, 27].
    This paper focuses on the cluster analysis, which is one of the main unsupervised
learning methods. Clustering is an exploratory tool that consists in dividing a set of
objects (biological samples, observations) into subsets (clusters) containing similar ob-
jects, while objects from different clusters are dissimilar. Despite a large number of
general clustering algorithms having been proposed, there is a lack of such methods for
⋆
     The research of A. Ushakov and I. Vasilyev is supported by the Russian Science Foun-
     dation, research project No. 15-11-20015, the work of X. Klimentova is supported by the
     FCT — Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and
     Technology) within project SFRH/BPD/101134/2014.
Copyright ⃝
          c by the paper’s authors. Copying permitted for private and academic purposes.
In: A. Kononov et al. (eds.): DOOR 2016, Vladivostok, Russia, published at http://ceur-ws.org
                       The p-median Problem with Order for Two-source Clustering           537

efficient integrative clustering of multisource data. The simplest and obvious approach
to integrative clustering consists in concatenating the feature vectors of the consid-
ered objects or samples followed by applying some well-known clustering algorithms to
analyse the data obtained. Though this approach may be useful in some rare instances,
it is rather inflexible due to its impossibility of capturing important features that are
specific to each dataset and in the case of heterogeneous data.
     We propose an approach to integrative clustering of two-source data which is based
on bi-level integer linear programming and metaheuristic optimization. As opposed
to previously developed integrative clustering methods, which are based on modelling
each dataset using mixture models or post hoc combining multiple base clusterings,
we propose a hybrid integer programming approach based on the bi-level p-median
problem with clients’ preferences. The approach is compared to those presented in [10],
indicating that our approach can provide clustering solutions competitive with other
distance-based integrative clustering algorithms. One of the first gene-drug integrative
clustering approach was proposed in [21] and was based on a hierarchical clustering
algorithm. Other mathematical optimization based approaches that have been applied
to simultaneously clustering gene expression and drug activity profiles are the Soft
Topographic Vector Quantization [8], relational k-means [11], genetic programming [3],
and a consensus p-median approach [10].
     The paper is structured as follows. The generic p-median clustering model is de-
scribed in section 2. In section 3 an extension to the bi-level problem is presented.
Finally, section 4 illustrates the comparison of different approaches and gives some
concluding remarks.


2    The p-median clustering problem
The p-median problem, proposed by [13], consists in choosing p sites for locating facili-
ties from the set of potential places in order to minimize the sum of weighted distances
between customers and open facilities. This model can naturally be formulated in the
form of the following combinatorial optimization problem:
                                   {∑                     }
                               min       min dij : |S| = p ,
                                S⊆I         i∈S
                                      j∈J

where |I| = {1, . . . , m} is a set of potential facility sites, J = {1, . . . , l} is a set of
customers, and dij — the distance (or the cost of satisfying the demand) between
customer j ∈ J and the facility located at site i ∈ I.
     The p-median problem is known to be NP-hard in the strong sense and now sup-
posed to be well-studied. Apart from a number of applications, the p-median problem is
a powerful tool for the cluster analysis. To transform the problem into a clustering tool,
let us assume that I = J, i.e. the potential sites and customer locations coincide. In this
case the p-median problem (also referred as the minimum-sum-of-stars clustering [16]),
can be formulated on a simple digraph G(I, A) with the node set I corresponding to
the set of samples to be clustered and the arc set A = {(i, j) : i, j ∈ I; i ̸= j}, each arc
(i, j) ∈ A has a weight (distance) dij > 0 measuring the dissimilarity between each pair
of samples. Now the p-median problem is to select p nodes, those are called medians, in
538    X. Klimentova, A. Ushakov, I. Vasilyev

order to minimize the sum of the distances between each node and its nearest median.
Any feasible solution to the problem consists of p stars with medians in the centres.
Each star is a cluster and the median is a cluster representative (prototype).
    The p-median model has several important advantages over centroid-based para-
metric clustering approaches, like k-means and its variations. First of all, modern hy-
brid heuristics and exact approaches are able to find optimal (global) or suboptimal
solutions to the p-median instances of huge size (e.g. see [4, 12, 15]), while k-means or
k-means type algorithms, like PAM, CLARA, CLARANS, k-medians, are heuristics
that converge fast to a local optimum. Secondly, the classical k-means algorithm pre-
supposes using the squared Euclidean distance as the measure of similarity, which not
always results in a good clustering. For other types of parametric clustering problems,
several algorithms using k-means type iterative relocation scheme have been proposed,
e.g. k-medians in the case of Manhattan distance or the Linde-Buzo-Gray algorithm
when the Itakura-Saito distance is considered [6].
   The p-median problem can be formulated as an integer linear program. For each
node i ∈ I, let us consider a binary variable yi , which takes the value 1 if node i is a
median (sample i is a cluster representative), and takes the value 0 otherwise. For each
arc (i, j) ∈ A let xij be the binary variable which is equal to 1 if node i is the closest
median to node j (sample j is assigned to cluster i), and takes the value 0 otherwise.
Let also δ − (j) = {i ∈ I| (i, j) ∈ A} be a set of nodes of the graph G assigned to j with
outgoing arcs, and let δ + (i) = {j ∈ I| (i, j) ∈ A} be a set of nodes assigned to i with
the arcs leaving node i. Thus, the p-median problem can be written as follows [5]:

                       ∑
              min                 dij xij ,                                           (1)
              (x,y)
                      (i,j)∈A
                        ∑
                                  xij + yj = 1,            j ∈ I,                     (2)
                      i∈δ − (j)

                    xij ≤ yi ,                             i ∈ I, j ∈ δ + (i),        (3)
                    ∑
                        yi = p,                                                       (4)
                      i∈I
                    yi ∈ {0, 1},                           i ∈ I,                     (5)
                    xij ∈ {0, 1},                          (i, j) ∈ A.                (6)


Objective function (1) minimizes the overall total sum of distances between all nodes
and the closest medians. Constraints (2) guarantee that either node j is a median or
it is assigned to a median. Constraints (3) ensure that each node can only be assigned
to medians. Constraint (4) enforces the number of medians to be p.
    Several studies considered the p-median model for clustering various types of data,
e.g. psychological data [18] or gene expression profiles and drug responses [10]. More-
over, p-median model is proved to be a powerful clustering tool that provides high qual-
ity clustering solutions outperforming those provided by CLARA and CLARANS [15]
or k-means and k-means++ [23].
                        The p-median Problem with Order for Two-source Clustering        539

3    Bi-level p-median clustering
Suppose that we are given a set I of objects (cell lines) characterized by two feature
vectors, i.e. representing gene expression and drug activity profiles. Here we propose
an approach to two-source integrative clustering based on a bi-level version of the
p-median problem and study its application to clustering cancer cell lines. Let two
matrices dij ≥ 0 and gij ≥ 0 indicate the dissimilarity of a pair of cell lines (i, j) ∈
A = {(i, j) : i ∈ I, j ∈ I, i ̸= j} in the drug and gene space respectively. Then, let us
consider the following bi-level p-median clustering problem:
                            ∑
                      min         dij x∗ij (y),                                       (7)
                        y
                              (i,j)∈A
                              ∑
                                    yi = p,                                              (8)
                              i∈I
                            yi ∈ {0, 1},                         i ∈ I,                  (9)

where x∗ij (y) is the optimal solution to the following lower-level problem:
                      ∑
                min        gij xij ,                                                    (10)
                x
                    (i,j)∈A
                      ∑
                                xij + yj = 1,               j ∈ I,                      (11)
                    i∈δ − (j)

                    xij ≤ yi ,                              i ∈ I, j ∈ δ + (i),         (12)
                    xij ∈ {0, 1},                           (i, j) ∈ A.                 (13)

On the upper level on seeks for p cell lines to be the cluster representatives such that
the sum of dissimilarities of drug activity profiles between cell lines and its closest
representatives is minimised. On the lower level all cell lines are assigned to the cluster
representatives, selected at the first level, in order to minimise the sum of dissimilarities
of gene expression profiles between cell lines belonging to the same cluster and its closest
representatives. In other words, the decision about which cell lines will be the medians
(cluster representatives) is made on the first level according to the matrix {dij }, while
assigning remaining cell lines to clusters is performed on the second level taking into
account the dissimilarity matrix {gij }, (i, j) ∈ A. Thus, if gij ≤ gkj and both i, k ∈ I
are cluster representatives, then cell line j is assigned to cluster Ci .
    Note that when all the columns of the lower-level matrix {gij } are sorted in ascend-
ing order or when dij = gij , the problem (7)–(13) is a particular case of the p-median
problem. Thus, it is NP-hard in the strong sense and does not belong to the class
APX [2]. In general case the solution x∗ij (y) may be not unique, then one has to specify
what kind of solution is supposed to be optimal to the clustering problem (7)–(13).
Two extreme cases, i.e. cooperative and non-cooperative decision-making strategies,
are often emphasised [1], depending on whether cell lines on the lower level are as-
signed to their closest representatives in order to respectively minimise or maximise
the value of the upper-level objective. To avoid these cases for the sake of simplicity,
we suppose that all elements of a column j of the matrix {gij } are distinct [24].
540     X. Klimentova, A. Ushakov, I. Vasilyev

    The bi-level p-median clustering problem can be reduced to a single-level integer
linear program [14, 7]. Let us denote Wij = {k ∈ I : gij < gkj }, (i, j) ∈ A, then the
single-level problem is
                    ∑
               min       dij xij ,                                                (14)
               (x,y)
                       (i,j)∈A
                              ∑
                   yi +               xkj ≤ 1,              i ∈ I, (i, j) ∈ A,          (15)
                             k∈Wij
                       ∑
                             yi = p,                                                    (16)
                       i∈I
                         ∑
                                   xij + yj = 1,            j ∈ I,                      (17)
                       i∈δ − (j)

                   xij ≤ yi ,                               i ∈ I, j ∈ δ + (i),         (18)
                   yi ∈ {0, 1},                             i ∈ I,                      (19)
                   xij ∈ {0, 1},                            (i, j) ∈ A.                 (20)

This formulation is identical to the p-median except constraints (15), which guarantee
that if i ∈ I is a cluster representative, then a cell line j ∈ I is not assigned to more
dissimilar (according to gene expression profiles) representatives from the set Wij .
Thus, xij , (i, j) ∈ A is the optimal solution to the lower-level problem for any yi , i ∈ I.
Note that one can similarly consider the bi-level p-median clustering problem with the
matrices {gij } and {dij } on the upper and lower levels respectively.
    To find the optimal solution to the bi-level p-median clustering problem, we have
developed an exact approach including a branch-and-cut algorithm and a metaheuristic
to search for initial upper bounds of the optimal value, which is detailed in [24–26]. The
method is based on the family facet-defining inequalities proposed in [26] for the simple-
plant location problem with order. For some node j and subset S ⊆ I let us denote by
bj (S) ∈ S the nearest node from S for j, i.e. Wij ⊂ Wbj (S)j for all i ∈ S \ {bj (S)}.
Theorem 1 For all i, u, v ∈ I the inequalities
                           ∑           ∑
                                xku +        xkv + yt ≤ 1,                              (21)
                                     k∈Wiu            t
                                                   k∈Uiu


where t = bv (I \Wiu ) and Uiu
                            t
                               = I \(Wiu ∪{t}), are valid for the polytope of the problem
(14)-(20).
The cutting plane method for this family of inequalities was implemented using two
computational tricks for reduction of the number of violated inequalities. On each
iteration we add the inequalities corresponding to the most distant hyperplane from the
current fractional solution, preventing the almost parallel inequalities from adding. To
find an upper bound of the optimal solution the standard scheme of simulated annealing
method was implemented, using flip-neighbourhood (see [26] for more detail).
    Finally, as an exact method we have used Cut-and-Branch scheme, one of the effec-
tive methods tested in previous works for a similar problem. In root node of branching
                       The p-median Problem with Order for Two-source Clustering         541

tree the cutting plane method is run and the new formulation obtained, as well as
an upper bound provided by the simulated annealing, are used in order to solve the
problem with branch-and-bound algorithm.

4    Clustering analysis of the NCI-60 dataset
The proposed approach was implemented as a program using C++ programming lan-
guage. The MIP solver FICO Xpress callable library has been used as a branch-and-cut
framework. We compare our results with those obtained in the paper [10] as well as
with other integrative clustering techniques presented in that paper. The number of
clusters p was equal to 9.
     As a measure of dissimilarity between cell lines both in the drug and gene spaces, we
apply one of the most widely used measure based on the Pearson correlation coefficient,
i.e. distij = 1 − corrij , (i, j) ∈ A [9].
     We report our results performing two-source cluster analysis of National Cancer
Institute (NCI)-60 panel of human tumour cancer cell lines [22]. The dataset consists
of 60 cell lines from 9 cancer tissues. To compare our approach, we use the same dataset
as was previously considered in [10, 21]. It includes 1376 gene expression profiles and
1400 drug activity patterns. Thus, I is a set of m = 60 cell lines, and the dissimilarity
matrices {gij } and {dij } are computed for each pair (i, j) ∈ A as distij , using the given
gene expression and drug response data.
     To evaluate the quality of a cluster solution we use the average Pearson correlation
coefficient
                                ∑p
                                          2         ∑
                         P =                                  corrij ,
                                    m(|Ck | − 1)
                             k=1              i,j∈{1,...,m}:i