Algorithms with Performance Guarantee for a Weighted 2-partition Problem Alexander Kel’manov Anna Motkova Sobolev Institute of Mathematics Sobolev Institute of Mathematics Acad. Koptyug avenue 4, Acad. Koptyug avenue 4, Novosibirsk State University Novosibirsk State University Pirogova str. 1, Pirogova str. 1, 630090 Novosibirsk, Russia 630090 Novosibirsk, Russia kelm@math.nsc.ru anitamo@mail.ru Abstract We consider the problem of partitioning a finite set of Euclidean points into two clusters minimizing the sum over both clusters the weighted sums of the squared intracluster distances from the elements of the clusters to their centers. The center of one of the clusters is unknown and determined as the average value over all points in the cluster, while the center of the other cluster is the origin. The weight factors for both intracluster sums are the cardinalities of the corresponding clusters. In this work, we present a short survey on the results for this problem and a new result: a 2-approximation algorithm. 1 Introduction In this work, we consider a strongly NP-hard problem of discrete optimization. The goal of our work is to present a short survey on the results for these problems. The problem under study is closely related to the well-known strongly NP-hard Weighted variance-based 2-clustering Problem Given a set Y = {y1 , . . . , yN } of points from Rq . Find a partition of Y into two non-empty clusters C and Y \ C such that ∑ ∑ |C| ∥y − y(C)∥2 + |Y \ C| ∥y − y(Y \ C)∥2 −→ min , y∈C y∈Y\C ∑ ∑ where y(C) = |C| y∈C y and y(Y \ C) = |C| y∈Y\C y are the geometric centers (centroids) of both clusters C 1 1 and Y \ C. The main difference between the problem under study and this problem is that in the considered problem the center of one of the clusters is fixed in the origin (without loss of generality). It is obvious, that the considered problem and the Weighted variance-based 2-clustering problem are not equivalent and so both of them need an individual study. The importance of the problem for applications motivates to continue researches, e.g. importance for geometric problems, approximation problems, statistical problems of joint evaluations and Copyright ⃝ c by the paper’s authors. Copying permitted for private and academic purposes. In: Yu. G. Evtushenko, M. Yu. Khachay, O. V. Khamisov, Yu. A. Kochetov, V.U. Malkova, M.A. Posypkin (eds.): Proceedings of the OPTIMA-2017 Conference, Petrovac, Montenegro, 02-Oct-2017, published at http://ceur-ws.org 304 testing hypotheses by nonuniform samples, problems of Data clustering, Data mining, Machine learning, Big data, applied problems in technical and medical diagnostics, etc. 2 Cardinality-weighted Variance-based 2-clustering with Given Center Problem 1. Given a set Y = {y1 , . . . , yN } of points from Rq and a positive integer M . Find a partition of Y into two non-empty clusters C and Y \ C such that ∑ ∑ F1 (C) = |C| ∥y − y(C)∥2 + |Y \ C| ∥y∥2 −→ min , (1) y∈C y∈Y\C ∑ 1 where y(C) = |C| y is the geometric center (centroid) of C and such that |C| = M . y∈C Problem 1 is strongly NP-hard [Kel’manov & Pyatkin, 2015, Kel’manov & Pyatkin, 2016]. Therefore, by [Garey & Johnson, 1979], there are neither exact polynomial-time nor exact pseudopolynomial-time algorithms for this problem, unless P=NP. Despite of this, a pseudopolynomial algorithm for the special case of Problem 1 exists. It proposed in [Kel’manov & Motkova, 2016a] and finds an optimal solution in the case of integer components of the points in the input set and fixed space dimension. Algorithm A1 (exact pseudopolynomial algorithm). Input: A set Y and some positive integer M ≤ N . Step 1. Construct G — a multidimensional cubic uniform in each coordinate grid of size 2D with the distance 1 M between the nodes and the center at the origin, and D is the maximal modulus of the coordinates of input points. Step 2. For each node x ∈ G, calculate g x (y) = (2M − N )∥y∥2 − 2M ⟨y, x⟩ , y ∈ Y; (2) find the subset B x which consists of M points of the set Y, at which the function g x (y) has the smallest values. Calculate F1 (Bx ) by (1). Step 3. In the family F1 (B x ), x ∈ G constructed at Step 2, find the node xA1 = arg min F1 (Bx ) and the x∈G subset B xA1 . Put CA1 = B xA1 . Output: The set CA1 . Theorem 1. Assume that the elements of Y have integer values from the interval [−D, D]. Then algorithm A1 finds an optimal solution of Problem 1 in O(qN (2M D + 1)q ) time. In the case of fixed space dimension q the algorithm is pseudopolynomial. In [Kel’manov & Pyatkin, 2015, Kel’manov & Pyatkin, 2016] it is also shown that for Problems 1 there does not exist any fully polynomial time approximation scheme (FPTAS), unless P=NP. In [Kel’manov & Motkova, 2016b] was proposed such approximation scheme for a special case of the Problem 1 when the space dimension q is fixed. Algorithm A2 (approximation scheme). Input: a set Y and numbers M and ε. For each point y ∈ Y Steps 1–6 are executed. Step 1. Compute the values g y (z), z ∈ Y, using formula (2); find a subset By ⊆ Y with M smallest values g (z), compute F (B y ) using formula (1). y Step 2. If F (B y ) = 0, then put CA3 = B y ; exit. √ 1 √ 1 2ε Step 3. Compute H = H(y) = M F1 (B y ) and h = h(y, ε) = M y q F1 (B ). Step 4. Construct the lattice G(y, h, H + h/2) using formula G(y, h, H + h/2) = {d ∈ Rq | d = y + h · (i1 , . . . , iq ), ik ∈ Z, |hik | ≤ H + h/2, k ∈ {1, . . . , q}}. Step 5. For each node x of the lattice G(y, h, H + h/2) compute the values g x (y), y ∈ Y, using formula (2) and find a subset Bx ⊆ Y with M smallest values g x (y). Compute F1 (B x ) using formula (1), remember this value and the set B x . Step 6. If F (B x ) = 0, then put CA2 = B x ; exit. 305 Step 7. In the family {B x | x ∈ G(y, h, H + h/2), y ∈ Y} of candidate sets that have been constructed in Steps 1–6, choose as a solution CA2 the set B x for which F1 (B x ) is minimal. Output: the set CA2 . Theorem ( (√ 2. For )q )any fixed ε > 0 algorithm A2 finds a (1 + ε)-approximate solution of Problem 1 in O qN 2 2q ε +2 time. Corollary 1. In the case when the dimension q of space is bounded by a constant value, this algorithm is an FPTAS. Later in [Kel’manov et al., 2017] was proposed the similar approximation scheme with the same time com- plexity for the generalization of Problem 1 in which the weight factors are given as input (positive real numbers). Also there was(presented an improved algorithm that allows to find (1 + ε)-approximate solution of generalized √ 2 ( πe )q/2 (√ 2 )q ) problem in O qN 2 ε +2 time. In the case when the dimension q of space is bounded by a constant value, this algorithm is also an FPTAS. In addition, improved algorithm remains polynomial even when the dimension q of the space is bounded by the value C log N , where C is the positive constant. It is clear that 1 in this case algorithm implements a PTAS with time complexity N O(log ε ) . The last proposed algorithm for Problem 1 until this moment is a 2-approximation algorithm. It was proposed in [Kel’manov & Motkova, 2017] and allows us to find an approximate solution in the polynomial time. Algorithm A3 (2-approximation algorithm). Input: N -elements set Y ⊂ Rq , natural number M ≤ N. For each point y ∈ Y Steps 1–2 are executed. Step 1. Compute the values g y (z), z ∈ Y, using formula (2); find an M -elements subset B y ⊆ Y with the smallest values g y (z), compute F1 (By ) using formula (1). Step 2. If F1 (By ) = 0, then put CA2 = B y ; exit. Step 3. In the family {By |, y ∈ Y} of candidate sets that have been constructed in Steps 1–2, choose as a solution CA3 the set B x , for which F1 (Bx ) is minimal. Output: the set CA3 . Two examples of an input set (of 1000 points) and admissible solutions (i.e. 300-elements subset By ) found by Algorithm A3 at step 1 are presented at Fig.1 (a) and (b). (a) (b) Fig. 1. Let us substuntiate this algorithm below. 306 The following two lemmas are well known. (see, for example, [Kel’manov & Romanchenko, 2012, Kel’manov & Romanchenko, 2014]). ∑ Lemma 1. For an arbitrary point x ∈ Rq , a finite set Z ⊂ Rq and z = |Z|1 z∈Z z (z is the centroid of Z), it is true that ∑ ∑ ∥z − x∥2 = ∥z − z∥2 + |Z| · ∥x − z∥2 . z∈Z z∈Z Lemma 2. For a finite set Z ⊂ R , if a point u ∈ Rq is closer (in terms of distance) to the centroid z of Z q than any point in Z, then ∑ ∑ ∥z − u∥2 ≤ 2 ∥z − z∥2 . z∈Z z∈Z Lemma 3. Let ∑ ∑ S(C, x) = |C| ∥y − x∥2 + |Y \ C| ∥y∥2 , C ⊆ Y, x ∈ Rq . y∈C y∈Y\C Then the next statements are true: ∑ nonempty fixed set C ⊆ Y the minimum of the function S(C, x) over x ∈ R is reached at the point q (1) for any 1 y(C) = |C| y; y∈C (2) if |C| = M = const, then for any fixed point x ∈ Rq the minimum of function S(C, x) over C ⊆ Y is reached at the subset Bx that consists of M points of the set Y, at which the function (2) has the smallest values. Proof. The first statement follows from Lemma 2 and the definition of the functions S and F . Since |Y| = N and |C| = M , the second statement follows from the next chain of equalities: ∑ ∑ ∑ ∑ S(C, x) = M ∥y − x∥2 + (N − M ) ∥y∥2 = M ∥y∥2 − 2M ⟨y, x⟩ + M 2 ∥x∥2 + y∈C y∈Y\C y∈C y∈C ∑ ∑ ∑{ } ∑ (N − M ) ( ∥y∥ − 2 ∥y∥ ) = 2 (2M − N )∥y∥2 − 2M ⟨y, x⟩ + M 2 ∥x∥2 + (N − M ) ∥y∥2 . y∈Y y∈C y∈C y∈Y It remains to note that in the last two equalities the last two addends do not depend on C. Lemma is proved. ( ) Theorem 3. An Algorithm A3 finds a 2-approximate solution of Problem 1 in time O qN 2 . Proof. Let C ∗ be an optimal subset and t = arg min∗ ||y − y(C ∗ )||2 be a point from a subset C ∗ that is the y∈C closest to its centroid. Step-by-step, the algorithm goes through every points y of the set Y. Since t ∈ Y, a permissible solution also will be formed for the point t: the subset B t , defined by Lemma 3 (if x = t). Besides, a value F (B t ) of objective function for the Problem 1 will be calculated. For this value we have this inequalities from definitions of the functions F and S and Lemma 3 and a definition of the point t: F (B t ) = S y(B ) (B t ) ≤ S t (Bt ) ≤ S t (C ∗ ). t (3) Applying Lemma 2 to the set Z = C ∗ and the point u = t, we have ∑ ∑ ||y − t||2 ≤ 2 ||y − y(C ∗ )||2 . y∈C ∗ y∈C ∗ Using this inequality and definition (2), we find an estimation for a right part of (3) ∑ ∑ S t (C ∗ ) = M ||y − t||2 + (N − M ) ||y||2 y∈C ∗ y∈Y\C ∗ ∑ ∑ ≤ 2M ||y − y(C ∗ )||2 + (N − M ) ||y||2 ≤ 2F (C ∗ ). (4) y∈C ∗ y∈Y\C ∗ Combining (3) and (4), we have F (Bt ) ≤ 2F (C ∗ ). (5) 307 From Step 3 we know that the algorithm finds a solution of Problem 1 in the next form: CA = arg min F (B y ). (6) y∈Y Since Bt ∈ {B y |y ∈ Y}, from (6) we have an inequality F (CA ) ≤ F (B t ). (7) Finally, (5) and (7) implies an estimate F (CA ) ≤ 2F (C ∗ ). (8) Let us consider the case when at Step 2 of the algorithm the condition F (B y ) = 0 is executed for some input point y ∈ Y. For every subset C ⊆ Y inequality F (C) ≥ 0 is correct, so the subset CA = By ⊆ Y is an optimal solution of Problem 1. Inequality (8) is correct for this solution too. It means that a subset CA is a 2-approximation solution of Problem 1. Let us estimate the time complexity of the algorithm. At Step 1 we need no more than O(qN ) operations to calculate values g y (z). Searching of M the smallest elements in the set of N elements requires O(N ) operations (for example, using an algorithm of finding n-th smallest value in an unordered array [Wirth, 1976]). Step 2 need constant time. So for any point y ∈ Y total execution time of Steps 1 and 2 is O(qN ). As Steps 1 and 2 are executed N times, total time complexity of this steps is O(qN 2 ). Time complexity of Step 3 is estimated by value O(N ). So, the time complexity of the algorithm is O(qN 2 ). Theorem 3 is proved. Two examples of an input set (of 1000 points) and 2-approximate solutions found by Algorithm A3 are presented at Fig.2 (a) (i.e. 400-elements subset CA ) and (b) (i.e. 600-elements subset CA ). (a) (b) Fig. 2. 3 Conclusion In this work, we presented a short survey on the results for the problem of 2-partitioning of points in Euclidean space into two clusters. Also we presented the new result: a 2-approximation algorithm. It seems important to continue studying and the issue of great interest is the substantiation of randomized algorithms for this problem with linear or sub-linear time complexity. 308 Acknowledgements The research for algorithms A1 and A2 was supported by the Russian Foundation for Basic Research, Projects 16-31-00186, 16-07-00168. Research for algorithm A3 was supported by the Russian Science Foundation, Project 16-11-10041. References [Kel’manov & Pyatkin, 2015] Kel’manov, A. V., & Pyatkin, A. V. (2015). NP-Hardness of Some Quadratic Eu- clidean 2-Clustering Problems. Dokl.Akad.Nauk, 464(5), 535–538. [Kel’manov & Pyatkin, 2016] Kel’manov, A. V., & Pyatkin, A. V. (2016). On the Complexity of Some Quadratic Euclidean 2-Clustering Problems. Comput. Math. Math. Phys., 56(3), 491–497. [Garey & Johnson, 1979] Garey, M. R., & Johnson, D. S. (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness. San Francisco: Freeman. [Kel’manov & Motkova, 2016a] Kel’manov, A. V., & Motkova, A. V. (2016). Exact Pseudopolynomial Algorithms for a Balanced 2-Clustering Problem. J. of Appl. and Ind. Math., 10(3), 349–355. [Kel’manov & Motkova, 2016b] Kel’manov, A. V., & Motkova, A. V. (2016). A Fully Polynomial-Time Approxi- mation Scheme for a Special Case of a Balanced 2-Clustering Problem. LNCS, 9869, 182–192. [Kel’manov et al., 2017] Kel’manov, A. V., & Motkova, A. V., & Shenmaier, V. V. (2017). Approximation Schemes for Some Quadratic Problems of Weighted 2-Partitioning a Set of Points. (in Russian) Tr. Inst. Mat. Mekh., 23(3), 159–170. [Kel’manov & Motkova, 2017] Kel’manov, A. V., & Motkova, A. V. (2017). An Approximation Polynomial-Time Algorithm for a Weighted 2-Clustering Problem with Restriction on Clusters Cardinalities. (in Russian) Comput. Math. Math. Phys. (accepted) [Kel’manov & Romanchenko, 2012] Kel’manov, A. V., & Romanchenko, S. M. (2012). An Approximation Algo- rithm for Solving a Problem of Search for a Vector Subset. J. Appl. Ind. Math., 6(1), 90–96. [Kel’manov & Romanchenko, 2014] Kel’manov, A. V., & Romanchenko, S. M. (2014). An FPTAS for a Vector Subset Search Problem. J. Appl. Indust. Math., 8(3), 329–336. [Wirth, 1976] Wirth, N. (1976). Algorithms + Data Structures = Programs. New Jersey: Prentice Hall. 309