Introduction

An Exact Pseudopolynomial Algorithm for a Problem of Finding a Family of Disjoint Subsets

Alexandr Galashov

galashov.alexandr@gmail.com 0

Alexander Kel'manov

0 1 0 Novosibirsk State University 1 Sobolev Institute of Mathematics , 4 Koptyug Ave., 630090 Novosibirsk , Russia

501 509

We consider a strongly NP-hard problem of nding a family of disjoint subsets with given cardinalities in a nite set of points from Euclidean space. The minimum of the sum over all subsets from required family of the sum of the squared distances from the elements of these subsets to their centers is used as a search criterion. The subsets centers are optimizable variables de ned as the mean values of the elements of these subsets. With an additional restriction on the problem that the coordinates of the input points are integer, an exact algorithm is proposed. This algorithm is pseudopolynomial in the case of xed space dimension and of xed number of required subsets.

Euclidean space subsets search clustering NP-hard problem exact pseudopolynomial-time algorithm

Introduction

The subject of our study is a strongly NP-hard problem of nding a family of disjoint subsets in a nite set of points from Euclidean space. Our aim is justi cation of an exact pseudopolynomial algorithm for a special case of this problem.

The investigation is motivated by poor study of the problem and its importance in applications, in particular, for mathematical problems of data analysis, approximation theory and mathematical statistics. The problem models a situation where it's required to classify noisy data provided from experimental observations for the states of some material objects (see [1{4] and works cited there).

The paper is organized as follows. In the next section the formal de nition of the problem under study is given; an example of application (origin) of the problem being investigated is also presented. In Section 3, we provide a review of the known results and announce the obtained algorithmic result. Basic de nitions and statements, that give elements to prove the properties of the proposed algorithm, are presented in Section Copyright ⃝c by the paper's authors. Copying permitted for private and academic purposes. In: A. Kononov et al. (eds.): DOOR 2016, Vladivostok, Russia, published at http://ceur-ws.org 4. Finally, in Section 5 we construct an exact algorithm for solving the problem under study, justify its properties and show its pseudopolynomiality for a special case of the problem. 1

Problem formulation and its origin

Everywhere below R denotes the set of real numbers, ∥ ∥ denotes the Euclidean norm, and Z denotes the set of integer numbers.

We consider the following problem.

Problem 1. Given a set Y = fy1; : : : ; yN g of points from Rq and some positive integers M1; : : : ; MJ . Find a family fC1; : : : ; CJ g of disjoint subsets of Y such that

J F (C1; : : : ; CJ ) = ∑ ∑ ∥y j=1 y2Cj y(Cj )∥2 ! min ; where y(Cj ) = 1 ∑ y is the centroid (geometrical center) of the subset Cj , under jCjj y2Cj constraints jCj j = Mj ; j = 1; : : : ; J , and

J ∑ Mj j=1

N ; on the cardinalities of the required subsets.

The Problem 1 has the following origin [5]. Given a table Y = fy1; : : : ; yN g containing results of multiple measurements of the q-dimenstional tuple y of numerical characteristics corresponding to J unique objects. Numbers Mj , j = 1; : : : ; J , of measurements for the j-th object are known. Moreover, the table includes (N ∑jJ=1 Mj ) results of single measurements of the states of some random objects. Each result of measurement, presented in the table, has an error. Furthermore, the correspondence between the measurements and the objects is unknown. It's required to nd a family fC1; : : : ; CJ g of disjoint subsets of the set Y, using the minimum of squared distances criterion, where each subset contains the elements corresponding to the appropriate unique object, and estimate values y(C1); : : : ; y(CJ ) corresponding to the numerical characteristics of the unique objects (assuming that the data has a measurement error).

The Problem 1 regularly appears as a mathematical problem, in particular, for Data Analysis and Pattern Recognition, and is induced [5] by the following approximation model.

Given a set Y = fy1; : : : ; yN g of points from Rq and some positive integers M1; : : : ; MJ . Find a family fM1; : : : ; MJ g of disjoint subsets of the set N = f1; : : : ; N g such that ∑ where 8w1 2 Rq; n 2 M1; > >>< . . . >wJ 2 Rq; n 2 MJ ; > > :vn 2 Rq; n 2 N n([jJ=1Mj ); ( 4 ) under the same constraints on the required subsets, as in Problem 1.

This problem consists in approximation of the sequence yn by the sequence xn using the minimum of squared distances criterion under the condition that the structure of the sequence xn is described by the formula ( 4 ). In this formula, the points w1; : : : ; wJ are interpreted as the numerical characteristics describing the unique objects. The points from the set fvi; i 2 N n([jJ=1Mj )g are considered as the collection of characteristics describing the random objects.

Having uncovered the sum ( 3 ) using ( 4 ) and grouped the terms, we can check using derivation, that for each xed collection fM1; : : : ; MJ g of sets, the values wj = y(Mj ); j = 1; : : : ; J , provide the points where the objective function ( 3 ) attains its minimum. If we put these values in the formulae ( 3 ), ( 4 ) and let Cj = fyi 2 Y j i 2 Mj g, then we can simply verify (see also [5]) that the model ( 3 ), ( 4 ) induces Problem 1.

A similar to Problem 1 is well-known strongly NP-hard [6] problem MSSC (Minimum Sum-of-Squares Clustering) of clustering, which is also known as k-means [7], [8], [9]. In this problem the objective function is the same as in Problem 1, but the cardinalities of the required clusters are optimizable variables and instead of searching a family of subsets which union could not cover all input set, we have to nd a partition of this set.

Notice, that the numbers from the set N n([jJ=1Mj ) in the model ( 3 ), ( 4 ) correspond to the elements from the set Yn([jJ=1Cj ) in Problem 1. Elements of these sets do not appear in the formulae for the estimation of the values wj , j = 1; : : : ; J , and the corresponding centroids y(Cj ), j = 1; : : : ; J , in Problem 1. Therefore, Problem 1 could be considered as a problem of Data Censoring [10]. In this applied problem, a part of the data (generally, unknown part) of obtained experimental results in a situation of random failure of measurement instument has to be excluded from the estimation procedure, because this data distorts nal estimations. 2

Known and obtained results

The strong NP-hardness of Problem 1 is implied from the results obtained in [11], since in the cited work it was proved that the special case of Problem 1 when J = 1 is NP-hard in the strong sense problem.

The Problem 1 is referred to the algorithmically poorly-studied problems of the discrete optimization. In [5] for this problem, was proposed a 2-approximation algorithm which time complexity is equal to O(N 2(N J+1 + q)). For the case of Problem 1 when the number J of required subsets is xed (not the input of the problem), this algorithm is polynomial. Currently, there are no other algorithmic results for Problem 1 and the known results [12{15] were obtained only for its special case when J = 1. These results are described below.

For Problem 1 in [12], a 2-approximation polynomial-time algorithm of complexity O(qN 2) was constructed.

For the variation of Problem 1 with an additional restriction that the coordinates of the input points are integer and for the case of xed space dimension, in [13] an exact pseudopolynomial algorithm was presented, which time complexity is equal to O(N (M B)q), where B is the maximum absolute value of the coordinates of the input points.

Furthermore, for the case of the xed space dimension in [14] an FPTAS was proposed. This scheme for a given relative error " nds (1 + ")-approximate solution in O(N 2(M=")q) time, that is polynomial in the size of input and 1=".

A PTAS of complexity O(qN 2="+1(9=")3="), where " is a guaranteed relative error, was found in [15].

In the current work, for the variation of Problem 1 with an additional restriction that the coordinates of the input points are integer, an algorithm which nds an exact solution in O(N (N 2 + qJ )(2M B + 1)qJ + J 2 log2N ) time is constructed, where B is the maximum absolute value of the coordinates of the input points and M is the least common multiple for the numbers M1; : : : ; MJ . In the case of the xed dimension q of the space and of the xed number J of required subsets, the proposed algorithm is pseudopolynomial and its time complexity is bounded by O(N 3(M B)qJ ). 3

Fundamentals of algorithm

In order to justify the properties of the algorithm, we need a few basic statements, an auxiliary problem and an exact polynomial-time algorithm nding its solution.

The following statements are the geometrical basis of the algorithm.

Lemma 1. Let all conditions of Problem 1 be satis ed and Y

Zq. Moreover, let y2Y i2f1;:::;qg j(y)ij

B = max max be the maximum absolute value of the coordinates of the input points, where (y)i is the i-th coordinate of the point y. Then, y(Cj ) 2 D; j = 1; : : : ; J , where 1

M D = fz 2 Rqj (z)k = (v)k; (v)k 2 Zq; j(v)kj

M B; k = 1; : : : ; qg ; where M is the least common multiple for the numbers M1; : : : ; MJ .

Proof. According to the de nition of the geometrical center (as the mean over the set), we have (y(Cj ))k = 1

∑ (y)k = jCj j y2Cj p Mj ; j = 1; : : : ; J; k = 1; : : : ; q ; where p is an integer such that jpj = j ∑ (y)kj

Mj B

M B : ( 5 ) ( 6 )

By the de nition of the least common multiple M , for each j = 1; : : : ; J , there exists an integer s(j) such that 0 < s(j) M , and M = s(j)Mj . Therefore, p Mj = s(j)p s(j)Mj = s(j)p

M ; j = 1; : : : ; J : To conclude, notice that s(j)p 2 Z and js(j)pj

Bs(j)

BM : jDj = (2M B + 1)q :

The set D is the multi-dimensional grid with the rational step and with a center in the origin. For the cardinality of this grid the following obvious equality holds Lemma 2. For each non-empty nite set Z of points from Rq, the minimum over x of the z∑2Z ∥z x∥2 is attained at x = z = jZ1 j ∑z2Z z.

It could be easily checked by the derivation.

The calculation basis of the algorithm is given by an exact polynomial-time algorithm nding the solution of the following auxiliary problem.

Problem 2. Given a set Y = fy1; : : : ; yN g, a tuple b = (b1; : : : ; bJ ) of points from Rq and some positive integers M1; : : : ; MJ . Find a family fB1; : : : ; BJ g of disjoint subsets of the set Y such that

J Gb(B1; : : : ; BJ ) = ∑ ∑ ∥y

bj ∥2 ! min ; j=1 y2Bj under constraints jBj j = Mj ; j = 1; : : : ; J , and ( 2 ) on the cardinalities of the required subsets.

The Problem 2 can be reduced [5] to the known assignment problem (see for example [16]) and its exact solution can be found [5] in O(N (N 2 + qJ )) time. In the following, the algorithm solving Problem 2 is noted by A1.

The following technical lemma lies in the fundamentals of the alorithm and is given for the completeness of the description.

Lemma 3. The least common multiple for the numbers M1; : : : ; MJ , can be found in O(J ∑jJ=1 log2 Mj ) time.

Proof. Let M = lcm(M1; : : : ; MJ ) be the least common multiple for the numbers M1; : : : ; MJ . Using well known properties of the least common multiple, we have the following equalities.

lcm(Mi; Mj ) =

MiMj gcd(Mi; Mj ) ;

⊔⊓ ( 7 ) ( 8 ) ( 9 ) where gcd(Mi; Mj ) is the greatest common divisor for the numbers Mi; Mj . lcm(M1; : : : ; MJ ) = lcm(lcm(M1; : : : ; MJ 1); MJ ) : ( 10 )

It's known that the greatest common divisor for the numbers a; b can be found by the Euclidean algorithm in O(log2(max(a; b))) time.

From ( 9 ) and ( 10 ), the obvious recurrent algorithm of computing the least common multiple for the numbers M1; : : : ; MJ , is implied. For the complexity of computing the least common multiple for the numbers M1; : : : ; MJ , we have

J 1 ∑(log(max(MJ 1+1; lcm(M1; : : : ; MJ i)))2 i=1 J J i+1 ∑(log2( ∏ i=1 k=1

Mk)) where we used the following obvious upper bound on the least common multiple lcm(M1; : : : ; MJ i)

J i ∏ Mk : k=1

J = O(J (∑ log2 Mk)) ; k=1 ⊔⊓ 4

Exact pseudopolynomial algorithm

The idea of the proposed algorithm is as follows. In the region of the input space, de ned by the maximum absolute value of the coordinates of the input points, a multidimensional grid is constructed which is uniform over each coordinate and has the rational step. By construction, all centroids including optimals belong to the grid.

For each tuple of J points from the constructed grid, the auxiliary Problem 2 is solved. Its solution | a family of disjoint subsets | is included in the collection of the candidates on the solution for Problem 1.

The family of the subsets with minimum value of the objective function of Problem 2 is taken as a solution of Problem 1.

Let us formulate an algorithm which implements the described approach. A l g o r i t h m A.

Input : Set Y and positive integers M1; : : : ; MJ .

Step 1. Find the least common multiple M for the numbers M1; : : : ; MJ using ( 9 ) and ( 10 ), and the value B using formula ( 5 ). Construct the grid D using formula ( 6 ).

Step 2. For every tuple d = (d1; : : : ; dJ ) 2 DJ , using Algorithm A1, nd and memorize the exact solution fB1(d); : : : ; BJ (d)g of the auxiliary Problem 2 and the value of the objective function Gd, putting b = d in the formula ( 8 ).

Step 3. In the collection fB1(d); : : : ; BJ (d)g, d 2 DJ , of the solutions found on step 2, nd the tuple dA = (d1A; : : : ; dJA) on which the objective function Gd attains its minimum. As the solution fC1A; : : : ; CJ g of Problem 1, we take constructed on step 2

A subsets B1(dA); : : : ; BJ (dA) corresponding to the tuple dA. Output : Collection fC1A; : : : ; CJAg.

Next statement speci es the properties of the proposed algorithm.

Theorem 1. Let in the conditions of Problem 1, the coordinates of all points from the set Y have integer values in the interval [ B; B]. Then, algorithm A nds an optimal solution of this problem in O(N (N 2 + qJ )(2M B + 1)qJ + J 2 log2N ) time. Proof. Let the subsets C1 ; : : : ; CJ be the optimal solution of Problem 1. According to the de nition of step 3, algorithm A nds the solution of Problem 1 in the form: where fC1A; : : : ; CJAg = fB1(dA); : : : ; BJ (dA)g ; dA = arg min Gd(B1(d); : : : ; BJ (d)) :

d2DJ

Thus, noticing that the elements of the optimal tuple y = (y(C1 ); : : : ; y(Cj )) belong to the set D, according to Lemma 1, from the de nitions ( 8 ), ( 12 ) and ( 1 ), the following is implied

GdA ∑J j=1 ∑ ∥y y2Cj

y(Cj )∥2 = F (C1 ; : : : ; CJ ) : Then, using Lemma 2, de nitions ( 1 ), ( 8 ), ( 11 ), we obtain ( 11 ) ( 12 ) ( 13 ) F (C1A; : : : ; CJA) = ∑J

j=1 ∑J j=1 ∑ ∥y y2CjA ∑ ∥y y2CjA y(CjA)∥2 djA∥2 = ∑J j=1 Combining ( 13 ) and ( 14 ), we nd the estimation ∑ y2Bj(dA) ∥y djA∥2 = GdA : ( 14 )

On the other hand, the sets C1A; : : : ; CJA are the feasible solution of Problem 1. Thus, we have the estimation From the obtained estimations, we have the equiality

F (C1A; : : : ; CJA)

F (C1 ; : : : ; CJ ) : F (C1 ; : : : ; CJ )

F (C1A; : : : ; CJA) :

F (C1A; : : : ; CJA) = F (C1 ; : : : ; CJ ) : Let us estimate the time complexity of the algorithm using its stepwise notation.

By Lemma 3, computational costs of computing the least common multiple for the numbers M1; : : : ; MJ on step 1 are O(J ∑kJ=1 log2 Mk), which is not greater than O(J 2 log2 N ). On this step, computational costs of calculation of the values B and D are bounded by O(qN ) and O( 1 ) respectively.

The time complexity required to nd an exact solution for the auxiliary Problem 2 (see Section 4) is equal to O(N (N 2 + qJ )). On step 2, this problem is solved O(jDjJ ) times. Therefore, according to ( 7 ), the complexity of step 2 is equal to O(N (N 2 + qJ )(2M B + 1)qJ ).

Summing up all the costs on all the steps, we nd that the time complexity of the algorithm is bounded by O(N (N 2 + qJ )(2M B + 1)qJ + J 2 log2N ). ⊔⊓ Remark 1. If the dimension q of the space and the number J of desired subsets are xed, the coordinates of the input points from Y are integer and belong to an interval [ B; B], then algorithm A is an exact pseudopolynomial algorithm solving Problem 1 in O(N 3(M B)qJ ) time. 5

Conclusion

In the paper, we have considered the strongly NP-hard problem of nding a family of disjoint subsets in a nite set of points from Euclidean space.

For the special case of the problem when the input points coordinates are integer, we have constructed an exact algorithm. This algorithm is a pseudopolynomial one when the input space dimension and the number of required subsets are xed.

Of considerable interest is justi cation of faster approximation polynomial-time algorithms with guaranteed accuracy for general problem.

Acknowledgments. This work was supported by the Russian Foundation for Basic Research (project nos. 15-01-00462, and 16-07-00168) and by the grant of Presidium of RAS (program 5, project 227).

1. Hastie

, Tibshirani

, Friedman J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction . Springer-Verlag, New York ( 2009 )

2. James

, Witten

, Hastie

, Tibshirani

: An Introduction to Statistical Learning. Springer Science+Business Media, LLC , New York ( 2013 )

3. Bishop

C.M.

: Pattern Recognition and Machine Learning . Springer Science+Business Media, LLC , New York ( 2006 )

4. Flach

: Machine Learning: The Art and Science of Algorithms that Make Sense of Data . Cambridge University Press, New York ( 2012 )

5. Galashov

A.E.

, Kelmanov

A.V.:

A 2-Approximate Algorithm to Solve One Problem of the Family of Disjoint Vector Subsets . J. Automation and Remote Control . 75 ( 4 ), 595 { 606 ( 2014 )

6. Aloise

, Deshpande

, Hansen

, Popat

: NP-hardness of Euclidean sum-of-squares clustering . J. Machine Learning . 75 ( 2 ), 245 { 248 ( 2009 )

7. Jain

A.K.

: Data Clustering: 50 Years Beyond k-Means . J. Pattern Recognition Lett . 31 , 651 { 666 ( 2010 )

8. Edwards

A.W.F.

, Cavalli-Sforza L .L.: A Method for Cluster Analysis . J. Biometrics . 21 , 362 { 375 , ( 1965 )

9. MacQueen J.B.: Some Methods for Classi cation and Analysis of Multivariate Observations . In: Fith Berkley Symposium on Mathematical Statistics and Probability , vol. 1 , pp. 281 { 297 . University of California Press, Berkley, California ( 1967 )

10. Bagdonavicius , V. , Kruopis , J. , Nikulin , M.S. : Non-parametric Tests for Censored Data . ISTE/WILEY, London, ( 2011 )

11. Kel

'manov

A.V. , Pyatkin

A.V. : NP-Completeness of Some Problems of Choosing a Vector Subset . J. Applied and Industrial Mathematics . 5 ( 3 ), 352 { 357 , ( 2011 )

12. Kel

'manov

A.V. , Romanchenko

S.M.:

An Approximation Algorithm for Solving a Problem of Search for a Vector Subset . J. Applied and Industrial Mathematics . 6 ( 1 ), 90 { 96 , ( 2012 )

13. Kel

'manov

A.V. , Romanchenko

S.M. : Pseudopolynomial Algorithms for Certain Computationally Hard Vector Subset and Cluster Analysis Problems . J. Automation and

Remote

Control . 73 ( 2 ), 349 { 354 , ( 2012 )

14. Kel

'manov

A.V. , Romanchenko

S.M.:

An FPTAS for a Vector Subset Search Problem . J. Applied and Industrial Mathematics . 8 ( 3 ), 329 { 336 , ( 2014 )

15. Schenmaier

V.V.

: An approximation scheme for a problem of search for a vector subset . J. Applied and Industrial Mathematics . 6 ( 3 ), 381 { 386 , ( 2012 )

16. Papadimitriou

C. H.

, Steiglitz

: Combinatorial Optimization: Algorithms and Complexity . Prentice-Hall, Englewood Cliffs, New Jersey ( 1982 )