=Paper=
{{Paper
|id=Vol-1987/paper43
|storemode=property
|title=Algorithms with Performance Guarantee for Some Quadratic Euclidean Problems of 2-Partitioning a Set and a Sequence
|pdfUrl=https://ceur-ws.org/Vol-1987/paper43.pdf
|volume=Vol-1987
|authors=Alexander Kel'manov,Vladimir Khandeev
}}
==Algorithms with Performance Guarantee for Some Quadratic Euclidean Problems of 2-Partitioning a Set and a Sequence==
Algorithms with Performance Guarantee for Some Quadratic Euclidean Problems of 2-Partitioning a Set and a Sequence Alexander Kel’manov Vladimir Khandeev Sobolev Institute of Mathematics Sobolev Institute of Mathematics Acad. Koptyug avenue 4, Acad. Koptyug avenue 4, Novosibirsk State University Novosibirsk State University Pirogova str. 1, Pirogova str. 1, 630090 Novosibirsk, Russia 630090 Novosibirsk, Russia kelm@math.nsc.ru khandeev@math.nsc.ru Abstract We consider the problems of 2-partitioning a finite set and a finite sequence of points in Euclidean space into clusters (subsets or sub- sequences) minimizing the sum of squared distances between cluster elements and the corresponding cluster centers. It is assumed that the center of one of the desired clusters is the origin, while the center of the other cluster is unknown and determined as the mean value over cluster elements. In this work, we present a short survey on some re- sults for these problems and a new result: a randomized algorithm for the sequence partitioning problem. 1 Introduction In this work, we consider two strongly NP-hard discrete optimization problems. The goal of our work is to present a short survey on some recent results for these problems. The problems under consideration are closely related to the well-known strongly NP-hard 2-MSSC (Minimum Sum-of-Squares Clustering) problem [Aloise et al., 2009]. This problem (also known as 2-Means) is to find a partition of a finite set Y of points from a Euclidean space into two subsets (clusters) C and Y \ C such that ∑ ∑ ∥y − y(C)∥2 + ∥y − y(Y \ C)∥2 → min, y∈C y∈Y\C ∑ ∑ y∈C y and y(Y \ C) = |C| y∈Y\C y are the centroids (the geometric centers) of sets C and 1 1 where y(C) = |C| Y \ C, respectively. The example of an input set of points on a plane is presented at Fig. 1. One needs to find a partition of this set into two clusters and find their centroids. It is assumed that the points are scattered around these centroids (the points of different clusters are colored differently). Copyright ⃝ c by the paper’s authors. Copying permitted for private and academic purposes. In: Yu. G. Evtushenko, M. Yu. Khachay, O. V. Khamisov, Yu. A. Kochetov, V.U. Malkova, M.A. Posypkin (eds.): Proceedings of the OPTIMA-2017 Conference, Petrovac, Montenegro, 02-Oct-2017, published at http://ceur-ws.org 298 Figure 1: Example of the input set The main difference of the considered problems from the 2-MSSC problem is that the center of one the clusters is fixed at a certain point in the Euclidean space (see the following sections); without loss of generality we can assume that this point is the origin. Additionally, in one of the problems under consideration the input is a sequence rather than a set, and there are constraints on the indices of elements of subsequences that are included in one cluster or another. Despite the similarities, the considered problems and the 2-MSSC problem are not equivalent. Therefore, the both considered problems need an individual study. Also, the problems under consideration are important for applications, e.g. for approximation theory, statistics, data mining, the remote sensing, geophysics, biometrics, medicine and technical diagnostics, processing the speech signals, electronic intelligence, radiolocation, hydroacoustics, etc. 2 Partitioning a Set Problem 1. Given: a set Y = {y1 , . . . , yN } of points from Rq and a positive integer number M . Find: a subset C ⊆ Y that minimizes the objective function ∑ ∑ S(C) = ∥y − y(C)∥2 + ∥y∥2 , y∈C y∈Y\C ∑ y∈C y is the centroid (the geometric center) of set C, under constrain |C| = M . 1 where y(C) = |C| Figure 2: Example of the input set 299 The example of an input set of points on a plane is presented at Fig. 2. One needs to find a partition of this set into two clusters. Unlike 2-MSSC problem, in Problem 1 one needs to find the centroid of only one cluster. It is assumed that the points of this cluster (shown by dark color) are scattered around its unknown centroid and that the points of the second one (shown by white color) are scattered around the origin. The strong NP-hardness of Problem 1 is shown in [Kel’manov & Pyatkin, 2009]. In the case of fixed space dimension the problem is solvable in polynomial time [Gimadi et al., 2010, Shenmaier, 2016]. A 2-approximation algorithm for Problem 1 is presented in [Dolgushev & Kel’manov, 2011]. The running time of the algorithm is O(qN 2 ). A polynomial-time approximation scheme for Problem 1 with a O(qN 2/ε+1 (9/ε)3/ε )- time complexity, where ε is an arbitrary relative error, is substantiated in [Dolgushev et al., 2015]. In [Kel’manov & Khandeev, 2015a], a randomized algorithm for Problem 1 is presented. For an established parameter value, the algorithm finds an approximate solution of the problem in polynomial time for given values of the relative error and failure probability; the conditions are established under which the algorithm is asymptotically exact and runs in polynomial time. Algorithm A1 (randomized algorithm). Input: Y, M , and a positive integer parameter k. Step 1. Generate a multiset T by independently and randomly choosing k elements one after another (with replacement) from Y. Step 2. For each nonempty H ⊆ T , compute the centroid y(H) and form a subset BM (y(H)) consisting of the M points from Y, having the largest projections onto the direction specified by this centroid. Compute S(BM (y(H))). Step 3. In the family of solutions found at Step 2, choose the subset CA = BM (y(H)) for which S(BM (y(H))) is minimal. If there are several solutions with the same minimal value, take an arbitrary of them. Output: CA . Theorem 1. For arbitrary δ ∈ (0, 1) and positive integer t ≤ k, algorithm A1 finds a (1 + δt 1 )- approximate solution of Problem 1 in O(2 q(k + N )) time with probability not less than 1 − (δ + α), where k ∑t−1 ( )( )i ( )k−i α = i=0 ki M N 1− M N . Corollary 1. Assume that M ≥ βN , where β ∈ (0, 1) is a constant. Then, given ε > 0 and γ ∈ (0, 1) for the fixed parameter k = max(⌈ β2 ⌈ γε 2 ⌉⌉, ⌈ β8 ln γ2 ⌉) algorithm A1 finds a (1 + ε)-approximate solution of Problem 1 with probability not less than 1 − γ in O(qN ) time. Below are the conditions for algorithm A1 to be asymptotically exact. Theorem 2. Under the conditions of Theorem 1, let k = ⌈log2 N ⌉, δ = (log2 N )−1/2 , t = ⌈ kM 2N ⌉. Assume that M ≥ βN , where β ∈ (0, 1) is a constant. Then algorithm A1 finds a (1 + εN )-approximate solution of Problem 1 β with probability 1 − γN in O(qN 2 ) time, where εN ≤ β2 (log2 N )−1/2 , γN ≤ (log2 N )−1/2 + N − 8 ln 2 . A pseudopolynomial algorithm which finds an optimal solution in the case of integer components of the points in the input set and fixed space dimension is proposed in [Kel’manov & Khandeev, 2015b]. Algorithm A2 (exact pseudopolynomial algorithm). Input: Y, M . Step 1. Given x ∈ G, where G is a multidimensional cubic uniform in each coordinate grid of size 2D with the 1 distance M between the nodes and the center at the origin, and D is the maximal modulus of the coordinates of input points, construct the subset BM (x) consisting of the ∑ M points from Y, having ∑ the largest projections onto the direction specified by x. Calculate Q(BM (x), x) = y∈BM (x) ∥y − x∥2 + y∈Y\BM (x) ∥y∥2 . Step 2. Find the point xA = arg min Q(BM (x), x) (if there are several points with the same minimal value, x∈G take an arbitrary of them) and the corresponding subset BM (xA ). As a solution to the problem, take CA = BM (xA ). Output: CA . Theorem 3. Suppose that, under the conditions of Problem 1, the points of Y have integer components lying in [−D, D]. Then algorithm A2 finds an optimal solution in O(qN (2M D + 1)q ) time. In the case of fixed space dimension q the algorithm is pseudopolynomial, since under this condition, the running time of the algorithm is estimated by O(N (M D)q ). In [Kel’manov & Khandeev, 2016], it was proved that, unless P=NP, in the general case of Problem 1 there is no fully polynomial-time approximation scheme (FPTAS). In addition, such a scheme was presented for the case of fixed space dimension. Algorithm A3 (approximation scheme). 300 Input: Y, M and ε > 0. For each point y ∈ Y Steps 1–5 are executed. Step 1. Construct the set BM (y) of the M points from Y, having the largest projections onto the direction specified by y. √ √ 2ε 1 Step 2. Compute S(BM (y)), h(y, ε) = qM S(BM (y)) and H(y) = M S(BM (y)). Step 3. If S(BM (y)) = 0, then return the set BM (y) as a result CA produced by the algorithm and exit; otherwise, go to the next step. Step 4. Construct a multidimensional cubic uniform in each coordinate grid G(y, h, H + h/2) of size 2H + h with the distance h between the nodes and the center at the point y. Step 5. For each node x of grid G(y, h, H + h/2), construct the set BM (x) of the M points from Y, having the largest projections onto the direction specified by x, and compute S(BM (x)). Step 6. In the family {BM (x) | x ∈ G(y, h, H + h/2), y ∈ Y} of sets, choose, as a solution CA , the set BM (x) for which S(BM (x)) is minimal. If there are several solutions with the same minimal value, take an arbitrary of them. Output: CA . Theorem √ 4. For any fixed ε > 0 algorithm A3 finds a (1 + ε)-approximate solution of Problem 1 in O(qN 2 ( 2q q ε + 2) ) time. In the case of fixed space dimension q the algorithm implements an FPTAS, since under this condition, the running time of the algorithm is estimated by O(N 2 (1/ε)q/2 ). 3 Partitioning a Sequence Problem 2. Given: a sequence Y = (y1 , . . . , yN ) of points from Rq , and some positive integer numbers Tmin , Tmax and M . Find: a subset M = {n1 , . . . , nM } ⊆ N = {1, . . . , N } such that ∑ ∑ F (M) = ∥yj − y(M)∥2 + ∥yi ∥2 → min, j∈M i∈N \M 1 ∑ where y(M) = |M| i∈M yi , under constraints 1 ≤ Tmin ≤ nm − nm−1 ≤ Tmax ≤ N, m = 2, . . . , M, (1) on the elements of (n1 , . . . , nM ). In this problem, (1) define the constraints on the elements of the cluster {yi | i ∈ M} with unknown centroid. In [Kel’manov & Pyatkin, 2013], the variant of Problem 2 was under study with Tmin and Tmax as parameters, and it was established that Problem 2 is strongly NP-hard for every Tmin < Tmax . In the trivial case when Tmin = Tmax , this problem is solvable in polynomial time. A 2-approximation polynomial algorithm with complexity O(N 2 (M N + q)) was proposed in [Kel’manov & Khamidullin, 2014]. For the case where the components of the points are integers and the dimension q is fixed, the exact pseu- dopolynomial algorithm was introduced in [Kel’manov et all., 2017a]. Algorithm A4 (exact pseudopolynomial algorithm). Input: Y, Tmin , Tmax , M . Step 1. For each point x ∈ G, where G is a multidimensional cubic uniform in each coordinate grid of size 1 2D with the distance M between the nodes and the center at the origin, and D is the maximal modulus of the coordinates of input points, find the optimal solution Mx of the problem ∑ Gx (M) = (2⟨yn , x⟩ − ∥x∥2 ) → max (2) n∈M under constraints (1) on the elements of M. Compute Gx (Mx ). Step 2. Find the point xA = arg max Gx (Mx ) (if there are several points with the same maximal value, take x∈G an arbitrary of them) and the corresponding subset MA = MxA . Output: MA . 301 At Step 1 the optimal solution of the problem (2) is found in O(N (M (Tmax − Tmin + 1) + q)) time using the dynamic programming scheme [Kel’manov & Khamidullin, 2014]. Theorem 5. Suppose that, under the conditions of Problem 2, the points of Y have integer components lying in [−D, D]. Then algorithm A4 finds an optimal solution in O(N (M (Tmax − Tmin + 1) + q)(2M D + 1)q ) time. In the case of fixed space dimension q the algorithm is pseudopolynomial, since under this condition, the running time of the algorithm is estimated by O(N 3 (M D)q ). In [Kel’manov et al., 2016], an approximation algorithm was proposed which for the fixed dimension q of the space implements the FPTAS scheme. Algorithm A5 (approximation scheme). Input: Y, Tmin , Tmax , M and ε > 0. For each point y ∈ Y Steps 1–5 are executed. √ solution M of problem (2) for x = y. y Step 1. Find [Kel’manov & Khamidullin, 2014] an optimal √ 2ε 1 Step 2. Compute F (My ), h = qM F (My ) and H = M F (My ). Step 3. If F (My ) = 0, then return the set My as a result MA produced by the algorithm and exit; otherwise, go to the next step. Step 4. Construct a multidimensional cubic uniform in each coordinate grid G(y, h, H + h/2) of size 2H + h with the distance h between the nodes and the center at the point y. Step 5. For each node x of grid G(y, h, H + h/2) construct [Kel’manov & Khamidullin, 2014] an optimal solution Mx of problem (2) and compute F (Mx ). Step 6. In the family {Mx | x ∈ G(y, h, H + h/2), y ∈ Y} of sets, choose, as a solution MA , the set Mx , for which F (Mx ) is minimal. If there are several solutions with the same minimal value, take an arbitrary of them. Output: MA . Theorem 6. For any fixed √ ε > 0 algorithm A5 finds a (1 + ε)-approximate solution of Problem 2 in O(N 2 (M (Tmax − Tmin + 1) + q)( 2q q ε + 2) ) time. In the case of fixed space dimension q the algorithm implements an FPTAS, since under this condition, the running time of the algorithm is estimated by O(M N 3 (1/ε)q/2 ). In [Kel’manov et al., 2017b], a randomized algorithm for Problem 2 is presented. Algorithm A6 (randomized algorithm). Input: Y, Tmin , Tmax , M , and a positive integer parameter k. Step 1. Generate a multiset T by independently and randomly choosing k elements one after another (with replacement) from Y. Step 2. For each nonempty H ⊆ T , compute the centroid y(H) and construct [Kel’manov & Khamidullin, 2014] an optimal solution Mx of problem (2) for x = y(H). Step 3. In the family of solutions found at Step 2, of sets, choose a set MA = Mx for which F (Mx ) is minimal. If there are several solutions with the same minimal value, take an arbitrary of them. Output: MA . Theorem 7. For arbitrary δ ∈ (0, 1) and positive integer t ≤ k, algorithm A6 finds a (1 + δt 1 )-approximate solution of Problem 2 in O(2 (qk + N (M (Tmax − Tmin + 1) + q))) time with probability not less than 1 − (δ + α), k ∑t−1 ( )( )i ( )k−i where α = i=0 ki M N 1 − M N . Corollary 2. Assume that M ≥ βN , where β ∈ (0, 1) is a constant. Then, given ε > 0 and γ ∈ (0, 1) for the fixed parameter k = max(⌈ β2 ⌈ γε 2 ⌉⌉, ⌈ β8 ln γ2 ⌉) algorithm A6 finds a (1 + ε)-approximate solution of Problem 2 with probability not less than 1 − γ in O(qM N 2 ) time. Theorem 8. Under the conditions of Theorem 7, let k = ⌈log2 N ⌉, δ = (log2 N )−1/2 , t = ⌈ kM 2N ⌉. Assume that M ≥ βN , where β ∈ (0, 1) is a constant. Then algorithm A6 finds a (1 + εN )-approximate solution of Problem 2 with probability 1 − γN in O(qM N 2 (Tmax − Tmin + 1)) time, where εN ≤ β2 (log2 N )−1/2 , γN ≤ β (log2 N )−1/2 + N − 8 ln 2 . Since Tmax − Tmin + 1 ≤ N , under the specified conditions the running time of the algorithm is estimated by O(qM N 3 ). 4 Conclusion In the paper we present a short survey on some results for two problems of 2-partitioning a finite set and a finite sequence of points in Euclidean space into clusters. Also we present a new randomized algorithm for the 302 sequence partitioning problem. For an established parameter value, the algorithm finds an approximate solution in polynomial time for fixed values of the relative error and failure probability. The conditions are found under which the algorithm is asymptotically exact and has polynomial-time complexity. The issue of great interest is the substantiation of faster randomized algorithms for the problems and the search for subclasses of the problems for which linear time and sub-linear time randomized algorithms can be designed. Acknowledgements This work was supported by Russian Foundation for Basic Research, Projects 15-01-00462, 16-31-00186, 16-07- 00168. References [Aloise et al., 2009] Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). NP-hardness of Euclidean sum- of-squares clustering. Mach. Learn., 75(2), 245–248. [Kel’manov & Pyatkin, 2009] Kel’manov, A. V., & Pyatkin, A.V. (2009). Complexity of certain problems of searching for subsets of vectors and cluster analysis. Comput. Math. and Math. Phys., 49(11), 1966– 1971. [Gimadi et al., 2010] Gimadi, E. Kh., Pyatkin, A. V., & Rykov, I. A. (2010). On polynomial solvability of some problems of a vector subset choice in a Euclidean space of fixed dimension. J. Appl. Ind. Math., 4(1), 48–53. [Shenmaier, 2016] Shenmaier, V. V. (2016). Solving some vector subset problems by Voronoi diagrams. J. Appl. Ind. Math., 10(4), 560–566. [Dolgushev & Kel’manov, 2011] Dolgushev, A. V., & Kel’manov, A. V. (2011). An approximation algorithm for solving a problem of cluster analysis. J. Appl. Ind. Math., 5(4), 551–558. [Dolgushev et al., 2015] Dolgushev, A. V., Kel’manov, A. V., & Shenmaier, V. V. (2015). Polynomial-time ap- proximation scheme for a problem of partitioning a finite set into two clusters (in Russian). Tr. Inst. Mat. Mekh., 21(3), 100–109. [Kel’manov & Khandeev, 2015a] Kel’manov, A. V., & Khandeev, V. I. (2015). A randomized algorithm for two- cluster partition of a set of vectors. Comput. Math. Math. Phys., 55(2), 330–339. [Kel’manov & Khandeev, 2015b] Kel’manov, A. V., & Khandeev, V. I. (2015). An exact pseudopolynomial al- gorithm for a problem of the two-cluster partitioning of a set of vectors. J. Appl. Ind. Math., 9(4), 497–502. [Kel’manov & Khandeev, 2016] Kel’manov, A. V., & Khandeev, V. I. (2016). Fully polynomial-time approxima- tion scheme for a special case of a quadratic euclidean 2-clustering problem. J. Appl. Ind. Math., 56(2), 334–341. [Kel’manov & Pyatkin, 2013] Kel’manov A. V., & Pyatkin A. V. (2013). On Complexity of Some Problems of Cluster Analysis of Vector Sequences. J. Appl. Indust. Math., 7(3), 363–369. [Kel’manov & Khamidullin, 2014] Kel’manov A. V., & Khamidullin S. A. (2014). An Approximating Polynomial Algorithm for a Sequence Partitioning Problem. J. Appl. Indust. Math., 8(2), 236–244. [Kel’manov et all., 2017a] Kel’manov A.V., Khamidullin S.A., & Khandeev V.I. (2017). Exact Pseudopolynomial Algorithm for one Sequence Partitioning Problem. Automat. Remote Control, 78(1), 66–73. [Kel’manov et al., 2016] Kel’manov A.V., Khamidullin S.A., & Khandeev V.I. (2016). A Fully Polynomial-Time Approximation Scheme for a Sequence 2-Cluster Partitioning Problem. J. Appl. Ind. Math., 10(2), 209–219. [Kel’manov et al., 2017b] Kel’manov A.V., Khamidullin S.A., & Khandeev V.I. (2017). A randomized algorithm for a sequence 2-clustering problem (in Russian). Comput. Math. and Math. Phys. (accepted). 303