=Paper= {{Paper |id=Vol-1623/papercpr2 |storemode=property |title=An Exact Pseudopolynomial Algorithm for a Problem of Finding a Family of Disjoint Subsets |pdfUrl=https://ceur-ws.org/Vol-1623/papercpr2.pdf |volume=Vol-1623 |authors=Alexandr Galashov, Alexander Kelmanov |dblpUrl=https://dblp.org/rec/conf/door/GalashovK16 }} ==An Exact Pseudopolynomial Algorithm for a Problem of Finding a Family of Disjoint Subsets== https://ceur-ws.org/Vol-1623/papercpr2.pdf
          An Exact Pseudopolynomial Algorithm
       for a Problem of Finding a Family of Disjoint
                         Subsets

                      Alexandr Galashov1 and Alexander Kel’manov1,2
                                  1
                                   Novosibirsk State University,
                           2 Pirogova St., 630090 Novosibirsk, Russia
                              2
                                Sobolev Institute of Mathematics,
                          4 Koptyug Ave., 630090 Novosibirsk, Russia
                       galashov.alexandr@gmail.com, kelm@math.nsc.ru



       Abstract. We consider a strongly NP-hard problem of finding a family of dis-
       joint subsets with given cardinalities in a finite set of points from Euclidean
       space. The minimum of the sum over all subsets from required family of the
       sum of the squared distances from the elements of these subsets to their cen-
       ters is used as a search criterion. The subsets centers are optimizable variables
       defined as the mean values of the elements of these subsets. With an additional
       restriction on the problem that the coordinates of the input points are integer,
       an exact algorithm is proposed. This algorithm is pseudopolynomial in the case
       of fixed space dimension and of fixed number of required subsets.

       Keywords: Euclidean space, subsets search, clustering, NP-hard problem, ex-
       act pseudopolynomial-time algorithm.


Introduction

The subject of our study is a strongly NP-hard problem of finding a family of disjoint
subsets in a finite set of points from Euclidean space. Our aim is justification of an
exact pseudopolynomial algorithm for a special case of this problem.
    The investigation is motivated by poor study of the problem and its importance in
applications, in particular, for mathematical problems of data analysis, approximation
theory and mathematical statistics. The problem models a situation where it’s required
to classify noisy data provided from experimental observations for the states of some
material objects (see [1–4] and works cited there).
    The paper is organized as follows. In the next section the formal definition of the
problem under study is given; an example of application (origin) of the problem being
investigated is also presented. In Section 3, we provide a review of the known results and
announce the obtained algorithmic result. Basic definitions and statements, that give
elements to prove the properties of the proposed algorithm, are presented in Section

Copyright ⃝
          c by the paper’s authors. Copying permitted for private and academic purposes.
In: A. Kononov et al. (eds.): DOOR 2016, Vladivostok, Russia, published at http://ceur-ws.org
502     A. Galashov, A. Kel’manov

4. Finally, in Section 5 we construct an exact algorithm for solving the problem under
study, justify its properties and show its pseudopolynomiality for a special case of the
problem.


1     Problem formulation and its origin

Everywhere below R denotes the set of real numbers, ∥ · ∥ denotes the Euclidean norm,
and Z denotes the set of integer numbers.
   We consider the following problem.

Problem 1. Given a set Y = {y1 , . . . , yN } of points from Rq and some positive integers
M1 , . . . , MJ . Find a family {C1 , . . . , CJ } of disjoint subsets of Y such that

                                                 ∑
                                                 J ∑
                        F (C1 , . . . , CJ ) =              ∥y − y(Cj )∥2 → min ,             (1)
                                                 j=1 y∈Cj

                        ∑
where y(Cj ) = |C1j |          y is the centroid (geometrical center) of the subset Cj , under
                        y∈Cj
constraints |Cj | = Mj , j = 1, . . . , J, and

                                                 ∑
                                                 J
                                                       Mj ≤ N ,                               (2)
                                                 j=1

on the cardinalities of the required subsets.

    The Problem 1 has the following origin [5]. Given a table Y = {y1 , . . . , yN } con-
taining results of multiple measurements of the q-dimenstional tuple y of numerical
characteristics corresponding to J unique objects. Numbers Mj , j = 1, . . . , J, of mea-
                                                                                    ∑J
surements for the j-th object are known. Moreover, the table includes (N − j=1 Mj )
results of single measurements of the states of some random objects. Each result of
measurement, presented in the table, has an error. Furthermore, the correspondence
between the measurements and the objects is unknown. It’s required to find a family
{C1 , . . . , CJ } of disjoint subsets of the set Y, using the minimum of squared distances
criterion, where each subset contains the elements corresponding to the appropriate
unique object, and estimate values y(C1 ), . . . , y(CJ ) corresponding to the numerical
characteristics of the unique objects (assuming that the data has a measurement er-
ror).
    The Problem 1 regularly appears as a mathematical problem, in particular, for Data
Analysis and Pattern Recognition, and is induced [5] by the following approximation
model.
    Given a set Y = {y1 , . . . , yN } of points from Rq and some positive integers M1 , . . . , MJ .
Find a family {M1 , . . . , MJ } of disjoint subsets of the set N = {1, . . . , N } such that
                                      ∑
                                           ∥yn − xn ∥2 → min ,                                 (3)
                                      n∈N
                An Exact Pseudopolynomial Algorithm for Disjoint Subsets Finding        503

where                         
                              
                              w1 ∈ Rq , n ∈ M1 ,
                              
                               ...
                         xn =                                                           (4)
                              
                              w J ∈ R , n ∈ MJ ,
                                      q
                              
                              
                               vn ∈ Rq , n ∈ N \(∪Jj=1 Mj ),
under the same constraints on the required subsets, as in Problem 1.
    This problem consists in approximation of the sequence yn by the sequence xn using
the minimum of squared distances criterion under the condition that the structure of the
sequence xn is described by the formula (4). In this formula, the points w1 , . . . , wJ are
interpreted as the numerical characteristics describing the unique objects. The points
from the set {vi , i ∈ N \(∪Jj=1 Mj )} are considered as the collection of characteristics
describing the random objects.
    Having uncovered the sum (3) using (4) and grouped the terms, we can check
using derivation, that for each fixed collection {M1 , . . . , MJ } of sets, the values wj =
y(Mj ), j = 1, . . . , J, provide the points where the objective function (3) attains its
minimum. If we put these values in the formulae (3), (4) and let Cj = {yi ∈ Y | i ∈ Mj },
then we can simply verify (see also [5]) that the model (3), (4) induces Problem 1.
    A similar to Problem 1 is well-known strongly NP-hard [6] problem MSSC (Min-
imum Sum-of-Squares Clustering) of clustering, which is also known as k-means [7],
[8], [9]. In this problem the objective function is the same as in Problem 1, but the
cardinalities of the required clusters are optimizable variables and instead of searching
a family of subsets which union could not cover all input set, we have to find a partition
of this set.
    Notice, that the numbers from the set N \(∪Jj=1 Mj ) in the model (3), (4) correspond
to the elements from the set Y\(∪Jj=1 Cj ) in Problem 1. Elements of these sets do not
appear in the formulae for the estimation of the values wj , j = 1, . . . , J, and the
corresponding centroids y(Cj ), j = 1, . . . , J, in Problem 1. Therefore, Problem 1 could
be considered as a problem of Data Censoring [10]. In this applied problem, a part
of the data (generally, unknown part) of obtained experimental results in a situation
of random failure of measurement instument has to be excluded from the estimation
procedure, because this data distorts final estimations.


2   Known and obtained results

The strong NP-hardness of Problem 1 is implied from the results obtained in [11],
since in the cited work it was proved that the special case of Problem 1 when J = 1 is
NP-hard in the strong sense problem.
    The Problem 1 is referred to the algorithmically poorly-studied problems of the dis-
crete optimization. In [5] for this problem, was proposed a 2-approximation algorithm
which time complexity is equal to O(N 2 (N J+1 + q)). For the case of Problem 1 when
the number J of required subsets is fixed (not the input of the problem), this algorithm
is polynomial. Currently, there are no other algorithmic results for Problem 1 and the
known results [12–15] were obtained only for its special case when J = 1. These results
are described below.
504     A. Galashov, A. Kel’manov

    For Problem 1 in [12], a 2-approximation polynomial-time algorithm of complexity
O(qN 2 ) was constructed.
    For the variation of Problem 1 with an additional restriction that the coordinates
of the input points are integer and for the case of fixed space dimension, in [13] an
exact pseudopolynomial algorithm was presented, which time complexity is equal to
O(N (M B)q ), where B is the maximum absolute value of the coordinates of the input
points.
    Furthermore, for the case of the fixed space dimension in [14] an FPTAS was pro-
posed. This scheme for a given relative error ε finds (1 + ε)-approximate solution in
O(N 2 (M/ε)q ) time, that is polynomial in the size of input and 1/ε.
    A PTAS of complexity O(qN 2/ε+1 (9/ε)3/ε ), where ε is a guaranteed relative error,
was found in [15].
    In the current work, for the variation of Problem 1 with an additional restriction
that the coordinates of the input points are integer, an algorithm which finds an exact
solution in O(N (N 2 + qJ)(2M B + 1)qJ + J 2 log 2 N ) time is constructed, where B is
the maximum absolute value of the coordinates of the input points and M is the least
common multiple for the numbers M1 , . . . , MJ . In the case of the fixed dimension q
of the space and of the fixed number J of required subsets, the proposed algorithm is
pseudopolynomial and its time complexity is bounded by O(N 3 (M B)qJ ).


3     Fundamentals of algorithm
In order to justify the properties of the algorithm, we need a few basic statements, an
auxiliary problem and an exact polynomial-time algorithm finding its solution.
    The following statements are the geometrical basis of the algorithm.

Lemma 1. Let all conditions of Problem 1 be satisfied and Y ⊂ Zq . Moreover, let

                                  B = max        max      |(y)i |                         (5)
                                        y∈Y i∈{1,...,q}


be the maximum absolute value of the coordinates of the input points, where (y)i is the
i-th coordinate of the point y. Then, y(Cj ) ∈ D, j = 1, . . . , J, where

                                1
         D = {z ∈ Rq | (z)k =     (v)k , (v)k ∈ Zq , |(v)k | ≤ M B, k = 1, . . . , q} ,   (6)
                                M
where M is the least common multiple for the numbers M1 , . . . , MJ .

Proof. According to the definition of the geometrical center (as the mean over the set),
we have
                         1 ∑ k            p
            (y(Cj ))k =          (y) =      , j = 1, . . . , J, k = 1, . . . , q ,
                        |Cj |            Mj
                             y∈Cj

where p is an integer such that
                                     ∑
                           |p| = |          (y)k | ≤ Mj B ≤ M B .
                                     y∈Cj
                An Exact Pseudopolynomial Algorithm for Disjoint Subsets Finding             505

    By the definition of the least common multiple M , for each j = 1, . . . , J, there
exists an integer s(j) such that 0 < s(j) ≤ M , and M = s(j)Mj . Therefore,

                           p     s(j)p   s(j)p
                              =        =       , j = 1, . . . , J .
                           Mj   s(j)Mj    M

To conclude, notice that s(j)p ∈ Z and

                                    |s(j)p|
                                            ≤ Bs(j) ≤ BM .
                                      M
                                                                                               ⊔
                                                                                               ⊓

   The set D is the multi-dimensional grid with the rational step and with a center in
the origin. For the cardinality of this grid the following obvious equality holds

                                       |D| = (2M B + 1)q .                                   (7)

Lemma  ∑2. For each non-empty finite set Z of
                                           ∑ points from R , the minimum over x
                                                          q

of the   ∥z − x∥ is attained at x = z = |Z| z∈Z z.
                2                        1
       z∈Z

It could be easily checked by the derivation.
    The calculation basis of the algorithm is given by an exact polynomial-time algo-
rithm finding the solution of the following auxiliary problem.

Problem 2. Given a set Y = {y1 , . . . , yN }, a tuple b = (b1 , . . . , bJ ) of points from Rq
and some positive integers M1 , . . . , MJ . Find a family {B1 , . . . , BJ } of disjoint subsets
of the set Y such that
                                                 J ∑
                                                 ∑
                       Gb (B1 , . . . , BJ ) =              ∥y − bj ∥2 → min ,               (8)
                                                 j=1 y∈Bj

under constraints |Bj | = Mj , j = 1, . . . , J, and (2) on the cardinalities of the required
subsets.

    The Problem 2 can be reduced [5] to the known assignment problem (see for example
[16]) and its exact solution can be found [5] in O(N (N 2 + qJ)) time. In the following,
the algorithm solving Problem 2 is noted by A1 .
    The following technical lemma lies in the fundamentals of the alorithm and is given
for the completeness of the description.

Lemma
   ∑J 3. The least common multiple for the numbers M1 , . . . , MJ , can be found in
O(J j=1 log2 Mj ) time.

Proof. Let M = lcm(M1 , . . . , MJ ) be the least common multiple for the numbers
M1 , . . . , MJ . Using well known properties of the least common multiple, we have the
following equalities.
                                                   Mi Mj
                                lcm(Mi , Mj ) =               ,                     (9)
                                                gcd(Mi , Mj )
506        A. Galashov, A. Kel’manov

where gcd(Mi , Mj ) is the greatest common divisor for the numbers Mi , Mj .

                     lcm(M1 , . . . , MJ ) = lcm(lcm(M1 , . . . , MJ−1 ), MJ ) .                (10)

    It’s known that the greatest common divisor for the numbers a, b can be found by
the Euclidean algorithm in O(log 2 (max(a, b))) time.
    From (9) and (10), the obvious recurrent algorithm of computing the least common
multiple for the numbers M1 , . . . , MJ , is implied. For the complexity of computing the
least common multiple for the numbers M1 , . . . , MJ , we have

    ∑
    J−1                                                     ∑
                                                            J         ∏
                                                                     J−i+1
          (log(max(MJ−1+1 , lcm(M1 , . . . , MJ−i )))2 ≤      (log2 (      Mk ))
    i=1                                                      i=1         k=1
                                                                                  ∑
                                                                                  J
                                                                         = O(J(       log2 Mk )) ,
                                                                               k=1

where we used the following obvious upper bound on the least common multiple

                                                            ∏
                                                            J−i
                                lcm(M1 , . . . , MJ−i ) ≤         Mk .
                                                            k=1

                                                                                                     ⊔
                                                                                                     ⊓


4     Exact pseudopolynomial algorithm
The idea of the proposed algorithm is as follows. In the region of the input space,
defined by the maximum absolute value of the coordinates of the input points, a multi-
dimensional grid is constructed which is uniform over each coordinate and has the
rational step. By construction, all centroids including optimals belong to the grid.
    For each tuple of J points from the constructed grid, the auxiliary Problem 2 is
solved. Its solution — a family of disjoint subsets — is included in the collection of the
candidates on the solution for Problem 1.
    The family of the subsets with minimum value of the objective function of Problem
2 is taken as a solution of Problem 1.
    Let us formulate an algorithm which implements the described approach.
    A l g o r i t h m A.
    Input: Set Y and positive integers M1 , . . . , MJ .
    Step 1. Find the least common multiple M for the numbers M1 , . . . , MJ using (9)
and (10), and the value B using formula (5). Construct the grid D using formula (6).
    Step 2. For every tuple d = (d1 , . . . , dJ ) ∈ DJ , using Algorithm A1 , find and
memorize the exact solution {B1 (d), . . . , BJ (d)} of the auxiliary Problem 2 and the
value of the objective function Gd , putting b = d in the formula (8).
    Step 3. In the collection {B1 (d), . . . , BJ (d)}, d ∈ DJ , of the solutions found on
step 2, find the tuple dA = (dA                  A                                    d
                                    1 , . . . , dJ ) on which the objective function G attains its
minimum. As the solution {C1A , . . . , CJA } of Problem 1, we take constructed on step 2
subsets B1 (dA ), . . . , BJ (dA ) corresponding to the tuple dA .
                    An Exact Pseudopolynomial Algorithm for Disjoint Subsets Finding                              507

   Output: Collection {C1A , . . . , CJA }.
   Next statement specifies the properties of the proposed algorithm.

Theorem 1. Let in the conditions of Problem 1, the coordinates of all points from the
set Y have integer values in the interval [−B, B]. Then, algorithm A finds an optimal
solution of this problem in O(N (N 2 + qJ)(2M B + 1)qJ + J 2 log 2 N ) time.

Proof. Let the subsets C1∗ , . . . , CJ∗ be the optimal solution of Problem 1. According to
the definition of step 3, algorithm A finds the solution of Problem 1 in the form:

                                 {C1A , . . . , CJA } = {B1 (dA ), . . . , BJ (dA )} ,                           (11)

where
                                  dA = arg min Gd (B1 (d), . . . , BJ (d)) .                                     (12)
                                                 d∈D J

    Thus, noticing that the elements of the optimal tuple y ∗ = (y(C1∗ ), . . . , y(Cj∗ )) be-
long to the set D, according to Lemma 1, from the definitions (8), (12) and (1), the
following is implied
                                   ∑J           ∑
                                                       ∥y − y(Cj∗ )∥2 = F (C1∗ , . . . , CJ∗ ) .
                             A
                        Gd ≤                                                                                     (13)
                                         j=1
                                               y∈Cj∗

Then, using Lemma 2, definitions (1), (8), (11), we obtain
                             ∑J        ∑
  F (C1A , . . . , CJA ) =                     ∥y − y(CjA )∥2
                                 j=1
                                       y∈CjA
                             ∑J         ∑                         ∑J            ∑                          A
                        ≤                      ∥y − dA
                                                     j ∥ =
                                                        2
                                                                                           ∥y − dA
                                                                                                 j ∥ =G
                                                                                                    2  d
                                                                                                               . (14)
                                 j=1                                   j=1
                                       y∈CjA                                 y∈Bj (dA )


Combining (13) and (14), we find the estimation

                                       F (C1A , . . . , CJA ) ≤ F (C1∗ , . . . , CJ∗ ) .

   On the other hand, the sets C1A , . . . , CJA are the feasible solution of Problem 1. Thus,
we have the estimation

                                       F (C1∗ , . . . , CJ∗ ) ≤ F (C1A , . . . , CJA ) .

From the obtained estimations, we have the equiality

                                       F (C1A , . . . , CJA ) = F (C1∗ , . . . , CJ∗ ) .

   Let us estimate the time complexity of the algorithm using its stepwise notation.
   By Lemma 3, computational costs of computing
                                             ∑J         the least common multiple for
the numbers M1 , . . . , MJ on step 1 are O(J k=1 log2 Mk ), which is not greater than
O(J 2 log2 N ). On this step, computational costs of calculation of the values B and D
are bounded by O(qN ) and O(1) respectively.
508     A. Galashov, A. Kel’manov

    The time complexity required to find an exact solution for the auxiliary Problem 2
(see Section 4) is equal to O(N (N 2 + qJ)). On step 2, this problem is solved O(|D|J )
times. Therefore, according to (7), the complexity of step 2 is equal to O(N (N 2 +
qJ)(2M B + 1)qJ ).
    Summing up all the costs on all the steps, we find that the time complexity of the
algorithm is bounded by O(N (N 2 + qJ)(2M B + 1)qJ + J 2 log 2 N ).                  ⊔
                                                                                     ⊓

Remark 1. If the dimension q of the space and the number J of desired subsets are
fixed, the coordinates of the input points from Y are integer and belong to an interval
[−B, B], then algorithm A is an exact pseudopolynomial algorithm solving Problem 1
in O(N 3 (M B)qJ ) time.


5     Conclusion

In the paper, we have considered the strongly NP-hard problem of finding a family of
disjoint subsets in a finite set of points from Euclidean space.
    For the special case of the problem when the input points coordinates are integer,
we have constructed an exact algorithm. This algorithm is a pseudopolynomial one
when the input space dimension and the number of required subsets are fixed.
    Of considerable interest is justification of faster approximation polynomial-time
algorithms with guaranteed accuracy for general problem.


Acknowledgments. This work was supported by the Russian Foundation for Basic
Research (project nos. 15-01-00462, and 16-07-00168) and by the grant of Presidium
of RAS (program 5, project 227).


References

 1. Hastie T., Tibshirani R., Friedman J.: The Elements of Statistical Learning: Data Mining,
    Inference, and Prediction. Springer-Verlag, New York (2009)
 2. James G., Witten D., Hastie T., Tibshirani R.: An Introduction to Statistical Learning.
    Springer Science+Business Media, LLC, New York (2013)
 3. Bishop C.M.: Pattern Recognition and Machine Learning. Springer Science+Business Me-
    dia, LLC, New York (2006)
 4. Flach P.: Machine Learning: The Art and Science of Algorithms that Make Sense of Data.
    Cambridge University Press, New York (2012)
 5. Galashov A.E., Kelmanov A.V.: A 2-Approximate Algorithm to Solve One Problem of the
    Family of Disjoint Vector Subsets. J. Automation and Remote Control. 75(4), 595–606
    (2014)
 6. Aloise D., Deshpande A., Hansen P., Popat P.: NP-hardness of Euclidean sum-of-squares
    clustering. J. Machine Learning. 75(2), 245–248 (2009)
 7. Jain A.K.: Data Clustering: 50 Years Beyond k-Means. J. Pattern Recognition Lett. 31,
    651–666 (2010)
 8. Edwards A.W.F., Cavalli-Sforza L.L.: A Method for Cluster Analysis. J. Biometrics. 21,
    362–375, (1965)
                An Exact Pseudopolynomial Algorithm for Disjoint Subsets Finding         509

 9. MacQueen J.B.: Some Methods for Classification and Analysis of Multivariate Observa-
    tions. In: Fith Berkley Symposium on Mathematical Statistics and Probability, vol. 1, pp.
    281 – 297. University of California Press, Berkley, California (1967)
10. Bagdonavicius, V.,Kruopis, J., Nikulin, M.S.: Non-parametric Tests for Censored Data.
    ISTE/WILEY, London, (2011)
11. Kel’manov A.V., Pyatkin A.V.: NP-Completeness of Some Problems of Choosing a Vector
    Subset. J. Applied and Industrial Mathematics. 5(3), 352–357, (2011)
12. Kel’manov A.V., Romanchenko S.M.: An Approximation Algorithm for Solving a Problem
    of Search for a Vector Subset. J. Applied and Industrial Mathematics. 6(1), 90–96, (2012)
13. Kel’manov A.V., Romanchenko S.M.: Pseudopolynomial Algorithms for Certain Compu-
    tationally Hard Vector Subset and Cluster Analysis Problems. J. Automation and Remote
    Control. 73(2), 349–354, (2012)
14. Kel’manov A.V., Romanchenko S.M.: An FPTAS for a Vector Subset Search Problem. J.
    Applied and Industrial Mathematics. 8(3), 329–336, (2014)
15. Schenmaier V.V.: An approximation scheme for a problem of search for a vector subset.
    J. Applied and Industrial Mathematics. 6(3), 381–386, (2012)
16. Papadimitriou C. H., Steiglitz K.: Combinatorial Optimization: Algorithms and Complex-
    ity. Prentice-Hall, Englewood Cliﬀs, New Jersey (1982)