=Paper=
{{Paper
|id=Vol-1623/papercpr2
|storemode=property
|title=An Exact Pseudopolynomial Algorithm for a Problem of Finding a Family of Disjoint Subsets
|pdfUrl=https://ceur-ws.org/Vol-1623/papercpr2.pdf
|volume=Vol-1623
|authors=Alexandr Galashov, Alexander Kelmanov
|dblpUrl=https://dblp.org/rec/conf/door/GalashovK16
}}
==An Exact Pseudopolynomial Algorithm for a Problem of Finding a Family of Disjoint Subsets==
An Exact Pseudopolynomial Algorithm for a Problem of Finding a Family of Disjoint Subsets Alexandr Galashov1 and Alexander Kel’manov1,2 1 Novosibirsk State University, 2 Pirogova St., 630090 Novosibirsk, Russia 2 Sobolev Institute of Mathematics, 4 Koptyug Ave., 630090 Novosibirsk, Russia galashov.alexandr@gmail.com, kelm@math.nsc.ru Abstract. We consider a strongly NP-hard problem of finding a family of dis- joint subsets with given cardinalities in a finite set of points from Euclidean space. The minimum of the sum over all subsets from required family of the sum of the squared distances from the elements of these subsets to their cen- ters is used as a search criterion. The subsets centers are optimizable variables defined as the mean values of the elements of these subsets. With an additional restriction on the problem that the coordinates of the input points are integer, an exact algorithm is proposed. This algorithm is pseudopolynomial in the case of fixed space dimension and of fixed number of required subsets. Keywords: Euclidean space, subsets search, clustering, NP-hard problem, ex- act pseudopolynomial-time algorithm. Introduction The subject of our study is a strongly NP-hard problem of finding a family of disjoint subsets in a finite set of points from Euclidean space. Our aim is justification of an exact pseudopolynomial algorithm for a special case of this problem. The investigation is motivated by poor study of the problem and its importance in applications, in particular, for mathematical problems of data analysis, approximation theory and mathematical statistics. The problem models a situation where it’s required to classify noisy data provided from experimental observations for the states of some material objects (see [1–4] and works cited there). The paper is organized as follows. In the next section the formal definition of the problem under study is given; an example of application (origin) of the problem being investigated is also presented. In Section 3, we provide a review of the known results and announce the obtained algorithmic result. Basic definitions and statements, that give elements to prove the properties of the proposed algorithm, are presented in Section Copyright ⃝ c by the paper’s authors. Copying permitted for private and academic purposes. In: A. Kononov et al. (eds.): DOOR 2016, Vladivostok, Russia, published at http://ceur-ws.org 502 A. Galashov, A. Kel’manov 4. Finally, in Section 5 we construct an exact algorithm for solving the problem under study, justify its properties and show its pseudopolynomiality for a special case of the problem. 1 Problem formulation and its origin Everywhere below R denotes the set of real numbers, ∥ · ∥ denotes the Euclidean norm, and Z denotes the set of integer numbers. We consider the following problem. Problem 1. Given a set Y = {y1 , . . . , yN } of points from Rq and some positive integers M1 , . . . , MJ . Find a family {C1 , . . . , CJ } of disjoint subsets of Y such that ∑ J ∑ F (C1 , . . . , CJ ) = ∥y − y(Cj )∥2 → min , (1) j=1 y∈Cj ∑ where y(Cj ) = |C1j | y is the centroid (geometrical center) of the subset Cj , under y∈Cj constraints |Cj | = Mj , j = 1, . . . , J, and ∑ J Mj ≤ N , (2) j=1 on the cardinalities of the required subsets. The Problem 1 has the following origin [5]. Given a table Y = {y1 , . . . , yN } con- taining results of multiple measurements of the q-dimenstional tuple y of numerical characteristics corresponding to J unique objects. Numbers Mj , j = 1, . . . , J, of mea- ∑J surements for the j-th object are known. Moreover, the table includes (N − j=1 Mj ) results of single measurements of the states of some random objects. Each result of measurement, presented in the table, has an error. Furthermore, the correspondence between the measurements and the objects is unknown. It’s required to find a family {C1 , . . . , CJ } of disjoint subsets of the set Y, using the minimum of squared distances criterion, where each subset contains the elements corresponding to the appropriate unique object, and estimate values y(C1 ), . . . , y(CJ ) corresponding to the numerical characteristics of the unique objects (assuming that the data has a measurement er- ror). The Problem 1 regularly appears as a mathematical problem, in particular, for Data Analysis and Pattern Recognition, and is induced [5] by the following approximation model. Given a set Y = {y1 , . . . , yN } of points from Rq and some positive integers M1 , . . . , MJ . Find a family {M1 , . . . , MJ } of disjoint subsets of the set N = {1, . . . , N } such that ∑ ∥yn − xn ∥2 → min , (3) n∈N An Exact Pseudopolynomial Algorithm for Disjoint Subsets Finding 503 where w1 ∈ Rq , n ∈ M1 , ... xn = (4) w J ∈ R , n ∈ MJ , q vn ∈ Rq , n ∈ N \(∪Jj=1 Mj ), under the same constraints on the required subsets, as in Problem 1. This problem consists in approximation of the sequence yn by the sequence xn using the minimum of squared distances criterion under the condition that the structure of the sequence xn is described by the formula (4). In this formula, the points w1 , . . . , wJ are interpreted as the numerical characteristics describing the unique objects. The points from the set {vi , i ∈ N \(∪Jj=1 Mj )} are considered as the collection of characteristics describing the random objects. Having uncovered the sum (3) using (4) and grouped the terms, we can check using derivation, that for each fixed collection {M1 , . . . , MJ } of sets, the values wj = y(Mj ), j = 1, . . . , J, provide the points where the objective function (3) attains its minimum. If we put these values in the formulae (3), (4) and let Cj = {yi ∈ Y | i ∈ Mj }, then we can simply verify (see also [5]) that the model (3), (4) induces Problem 1. A similar to Problem 1 is well-known strongly NP-hard [6] problem MSSC (Min- imum Sum-of-Squares Clustering) of clustering, which is also known as k-means [7], [8], [9]. In this problem the objective function is the same as in Problem 1, but the cardinalities of the required clusters are optimizable variables and instead of searching a family of subsets which union could not cover all input set, we have to find a partition of this set. Notice, that the numbers from the set N \(∪Jj=1 Mj ) in the model (3), (4) correspond to the elements from the set Y\(∪Jj=1 Cj ) in Problem 1. Elements of these sets do not appear in the formulae for the estimation of the values wj , j = 1, . . . , J, and the corresponding centroids y(Cj ), j = 1, . . . , J, in Problem 1. Therefore, Problem 1 could be considered as a problem of Data Censoring [10]. In this applied problem, a part of the data (generally, unknown part) of obtained experimental results in a situation of random failure of measurement instument has to be excluded from the estimation procedure, because this data distorts final estimations. 2 Known and obtained results The strong NP-hardness of Problem 1 is implied from the results obtained in [11], since in the cited work it was proved that the special case of Problem 1 when J = 1 is NP-hard in the strong sense problem. The Problem 1 is referred to the algorithmically poorly-studied problems of the dis- crete optimization. In [5] for this problem, was proposed a 2-approximation algorithm which time complexity is equal to O(N 2 (N J+1 + q)). For the case of Problem 1 when the number J of required subsets is fixed (not the input of the problem), this algorithm is polynomial. Currently, there are no other algorithmic results for Problem 1 and the known results [12–15] were obtained only for its special case when J = 1. These results are described below. 504 A. Galashov, A. Kel’manov For Problem 1 in [12], a 2-approximation polynomial-time algorithm of complexity O(qN 2 ) was constructed. For the variation of Problem 1 with an additional restriction that the coordinates of the input points are integer and for the case of fixed space dimension, in [13] an exact pseudopolynomial algorithm was presented, which time complexity is equal to O(N (M B)q ), where B is the maximum absolute value of the coordinates of the input points. Furthermore, for the case of the fixed space dimension in [14] an FPTAS was pro- posed. This scheme for a given relative error ε finds (1 + ε)-approximate solution in O(N 2 (M/ε)q ) time, that is polynomial in the size of input and 1/ε. A PTAS of complexity O(qN 2/ε+1 (9/ε)3/ε ), where ε is a guaranteed relative error, was found in [15]. In the current work, for the variation of Problem 1 with an additional restriction that the coordinates of the input points are integer, an algorithm which finds an exact solution in O(N (N 2 + qJ)(2M B + 1)qJ + J 2 log 2 N ) time is constructed, where B is the maximum absolute value of the coordinates of the input points and M is the least common multiple for the numbers M1 , . . . , MJ . In the case of the fixed dimension q of the space and of the fixed number J of required subsets, the proposed algorithm is pseudopolynomial and its time complexity is bounded by O(N 3 (M B)qJ ). 3 Fundamentals of algorithm In order to justify the properties of the algorithm, we need a few basic statements, an auxiliary problem and an exact polynomial-time algorithm finding its solution. The following statements are the geometrical basis of the algorithm. Lemma 1. Let all conditions of Problem 1 be satisfied and Y ⊂ Zq . Moreover, let B = max max |(y)i | (5) y∈Y i∈{1,...,q} be the maximum absolute value of the coordinates of the input points, where (y)i is the i-th coordinate of the point y. Then, y(Cj ) ∈ D, j = 1, . . . , J, where 1 D = {z ∈ Rq | (z)k = (v)k , (v)k ∈ Zq , |(v)k | ≤ M B, k = 1, . . . , q} , (6) M where M is the least common multiple for the numbers M1 , . . . , MJ . Proof. According to the definition of the geometrical center (as the mean over the set), we have 1 ∑ k p (y(Cj ))k = (y) = , j = 1, . . . , J, k = 1, . . . , q , |Cj | Mj y∈Cj where p is an integer such that ∑ |p| = | (y)k | ≤ Mj B ≤ M B . y∈Cj An Exact Pseudopolynomial Algorithm for Disjoint Subsets Finding 505 By the definition of the least common multiple M , for each j = 1, . . . , J, there exists an integer s(j) such that 0 < s(j) ≤ M , and M = s(j)Mj . Therefore, p s(j)p s(j)p = = , j = 1, . . . , J . Mj s(j)Mj M To conclude, notice that s(j)p ∈ Z and |s(j)p| ≤ Bs(j) ≤ BM . M ⊔ ⊓ The set D is the multi-dimensional grid with the rational step and with a center in the origin. For the cardinality of this grid the following obvious equality holds |D| = (2M B + 1)q . (7) Lemma ∑2. For each non-empty finite set Z of ∑ points from R , the minimum over x q of the ∥z − x∥ is attained at x = z = |Z| z∈Z z. 2 1 z∈Z It could be easily checked by the derivation. The calculation basis of the algorithm is given by an exact polynomial-time algo- rithm finding the solution of the following auxiliary problem. Problem 2. Given a set Y = {y1 , . . . , yN }, a tuple b = (b1 , . . . , bJ ) of points from Rq and some positive integers M1 , . . . , MJ . Find a family {B1 , . . . , BJ } of disjoint subsets of the set Y such that J ∑ ∑ Gb (B1 , . . . , BJ ) = ∥y − bj ∥2 → min , (8) j=1 y∈Bj under constraints |Bj | = Mj , j = 1, . . . , J, and (2) on the cardinalities of the required subsets. The Problem 2 can be reduced [5] to the known assignment problem (see for example [16]) and its exact solution can be found [5] in O(N (N 2 + qJ)) time. In the following, the algorithm solving Problem 2 is noted by A1 . The following technical lemma lies in the fundamentals of the alorithm and is given for the completeness of the description. Lemma ∑J 3. The least common multiple for the numbers M1 , . . . , MJ , can be found in O(J j=1 log2 Mj ) time. Proof. Let M = lcm(M1 , . . . , MJ ) be the least common multiple for the numbers M1 , . . . , MJ . Using well known properties of the least common multiple, we have the following equalities. Mi Mj lcm(Mi , Mj ) = , (9) gcd(Mi , Mj ) 506 A. Galashov, A. Kel’manov where gcd(Mi , Mj ) is the greatest common divisor for the numbers Mi , Mj . lcm(M1 , . . . , MJ ) = lcm(lcm(M1 , . . . , MJ−1 ), MJ ) . (10) It’s known that the greatest common divisor for the numbers a, b can be found by the Euclidean algorithm in O(log 2 (max(a, b))) time. From (9) and (10), the obvious recurrent algorithm of computing the least common multiple for the numbers M1 , . . . , MJ , is implied. For the complexity of computing the least common multiple for the numbers M1 , . . . , MJ , we have ∑ J−1 ∑ J ∏ J−i+1 (log(max(MJ−1+1 , lcm(M1 , . . . , MJ−i )))2 ≤ (log2 ( Mk )) i=1 i=1 k=1 ∑ J = O(J( log2 Mk )) , k=1 where we used the following obvious upper bound on the least common multiple ∏ J−i lcm(M1 , . . . , MJ−i ) ≤ Mk . k=1 ⊔ ⊓ 4 Exact pseudopolynomial algorithm The idea of the proposed algorithm is as follows. In the region of the input space, defined by the maximum absolute value of the coordinates of the input points, a multi- dimensional grid is constructed which is uniform over each coordinate and has the rational step. By construction, all centroids including optimals belong to the grid. For each tuple of J points from the constructed grid, the auxiliary Problem 2 is solved. Its solution — a family of disjoint subsets — is included in the collection of the candidates on the solution for Problem 1. The family of the subsets with minimum value of the objective function of Problem 2 is taken as a solution of Problem 1. Let us formulate an algorithm which implements the described approach. A l g o r i t h m A. Input: Set Y and positive integers M1 , . . . , MJ . Step 1. Find the least common multiple M for the numbers M1 , . . . , MJ using (9) and (10), and the value B using formula (5). Construct the grid D using formula (6). Step 2. For every tuple d = (d1 , . . . , dJ ) ∈ DJ , using Algorithm A1 , find and memorize the exact solution {B1 (d), . . . , BJ (d)} of the auxiliary Problem 2 and the value of the objective function Gd , putting b = d in the formula (8). Step 3. In the collection {B1 (d), . . . , BJ (d)}, d ∈ DJ , of the solutions found on step 2, find the tuple dA = (dA A d 1 , . . . , dJ ) on which the objective function G attains its minimum. As the solution {C1A , . . . , CJA } of Problem 1, we take constructed on step 2 subsets B1 (dA ), . . . , BJ (dA ) corresponding to the tuple dA . An Exact Pseudopolynomial Algorithm for Disjoint Subsets Finding 507 Output: Collection {C1A , . . . , CJA }. Next statement specifies the properties of the proposed algorithm. Theorem 1. Let in the conditions of Problem 1, the coordinates of all points from the set Y have integer values in the interval [−B, B]. Then, algorithm A finds an optimal solution of this problem in O(N (N 2 + qJ)(2M B + 1)qJ + J 2 log 2 N ) time. Proof. Let the subsets C1∗ , . . . , CJ∗ be the optimal solution of Problem 1. According to the definition of step 3, algorithm A finds the solution of Problem 1 in the form: {C1A , . . . , CJA } = {B1 (dA ), . . . , BJ (dA )} , (11) where dA = arg min Gd (B1 (d), . . . , BJ (d)) . (12) d∈D J Thus, noticing that the elements of the optimal tuple y ∗ = (y(C1∗ ), . . . , y(Cj∗ )) be- long to the set D, according to Lemma 1, from the definitions (8), (12) and (1), the following is implied ∑J ∑ ∥y − y(Cj∗ )∥2 = F (C1∗ , . . . , CJ∗ ) . A Gd ≤ (13) j=1 y∈Cj∗ Then, using Lemma 2, definitions (1), (8), (11), we obtain ∑J ∑ F (C1A , . . . , CJA ) = ∥y − y(CjA )∥2 j=1 y∈CjA ∑J ∑ ∑J ∑ A ≤ ∥y − dA j ∥ = 2 ∥y − dA j ∥ =G 2 d . (14) j=1 j=1 y∈CjA y∈Bj (dA ) Combining (13) and (14), we find the estimation F (C1A , . . . , CJA ) ≤ F (C1∗ , . . . , CJ∗ ) . On the other hand, the sets C1A , . . . , CJA are the feasible solution of Problem 1. Thus, we have the estimation F (C1∗ , . . . , CJ∗ ) ≤ F (C1A , . . . , CJA ) . From the obtained estimations, we have the equiality F (C1A , . . . , CJA ) = F (C1∗ , . . . , CJ∗ ) . Let us estimate the time complexity of the algorithm using its stepwise notation. By Lemma 3, computational costs of computing ∑J the least common multiple for the numbers M1 , . . . , MJ on step 1 are O(J k=1 log2 Mk ), which is not greater than O(J 2 log2 N ). On this step, computational costs of calculation of the values B and D are bounded by O(qN ) and O(1) respectively. 508 A. Galashov, A. Kel’manov The time complexity required to find an exact solution for the auxiliary Problem 2 (see Section 4) is equal to O(N (N 2 + qJ)). On step 2, this problem is solved O(|D|J ) times. Therefore, according to (7), the complexity of step 2 is equal to O(N (N 2 + qJ)(2M B + 1)qJ ). Summing up all the costs on all the steps, we find that the time complexity of the algorithm is bounded by O(N (N 2 + qJ)(2M B + 1)qJ + J 2 log 2 N ). ⊔ ⊓ Remark 1. If the dimension q of the space and the number J of desired subsets are fixed, the coordinates of the input points from Y are integer and belong to an interval [−B, B], then algorithm A is an exact pseudopolynomial algorithm solving Problem 1 in O(N 3 (M B)qJ ) time. 5 Conclusion In the paper, we have considered the strongly NP-hard problem of finding a family of disjoint subsets in a finite set of points from Euclidean space. For the special case of the problem when the input points coordinates are integer, we have constructed an exact algorithm. This algorithm is a pseudopolynomial one when the input space dimension and the number of required subsets are fixed. Of considerable interest is justification of faster approximation polynomial-time algorithms with guaranteed accuracy for general problem. Acknowledgments. This work was supported by the Russian Foundation for Basic Research (project nos. 15-01-00462, and 16-07-00168) and by the grant of Presidium of RAS (program 5, project 227). References 1. Hastie T., Tibshirani R., Friedman J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York (2009) 2. James G., Witten D., Hastie T., Tibshirani R.: An Introduction to Statistical Learning. Springer Science+Business Media, LLC, New York (2013) 3. Bishop C.M.: Pattern Recognition and Machine Learning. Springer Science+Business Me- dia, LLC, New York (2006) 4. Flach P.: Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press, New York (2012) 5. Galashov A.E., Kelmanov A.V.: A 2-Approximate Algorithm to Solve One Problem of the Family of Disjoint Vector Subsets. J. Automation and Remote Control. 75(4), 595–606 (2014) 6. Aloise D., Deshpande A., Hansen P., Popat P.: NP-hardness of Euclidean sum-of-squares clustering. J. Machine Learning. 75(2), 245–248 (2009) 7. Jain A.K.: Data Clustering: 50 Years Beyond k-Means. J. Pattern Recognition Lett. 31, 651–666 (2010) 8. Edwards A.W.F., Cavalli-Sforza L.L.: A Method for Cluster Analysis. J. Biometrics. 21, 362–375, (1965) An Exact Pseudopolynomial Algorithm for Disjoint Subsets Finding 509 9. MacQueen J.B.: Some Methods for Classification and Analysis of Multivariate Observa- tions. In: Fith Berkley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281 – 297. University of California Press, Berkley, California (1967) 10. Bagdonavicius, V.,Kruopis, J., Nikulin, M.S.: Non-parametric Tests for Censored Data. ISTE/WILEY, London, (2011) 11. Kel’manov A.V., Pyatkin A.V.: NP-Completeness of Some Problems of Choosing a Vector Subset. J. Applied and Industrial Mathematics. 5(3), 352–357, (2011) 12. Kel’manov A.V., Romanchenko S.M.: An Approximation Algorithm for Solving a Problem of Search for a Vector Subset. J. Applied and Industrial Mathematics. 6(1), 90–96, (2012) 13. Kel’manov A.V., Romanchenko S.M.: Pseudopolynomial Algorithms for Certain Compu- tationally Hard Vector Subset and Cluster Analysis Problems. J. Automation and Remote Control. 73(2), 349–354, (2012) 14. Kel’manov A.V., Romanchenko S.M.: An FPTAS for a Vector Subset Search Problem. J. Applied and Industrial Mathematics. 8(3), 329–336, (2014) 15. Schenmaier V.V.: An approximation scheme for a problem of search for a vector subset. J. Applied and Industrial Mathematics. 6(3), 381–386, (2012) 16. Papadimitriou C. H., Steiglitz K.: Combinatorial Optimization: Algorithms and Complex- ity. Prentice-Hall, Englewood Cliffs, New Jersey (1982)