=Paper= {{Paper |id=Vol-1987/paper43 |storemode=property |title=Algorithms with Performance Guarantee for Some Quadratic Euclidean Problems of 2-Partitioning a Set and a Sequence |pdfUrl=https://ceur-ws.org/Vol-1987/paper43.pdf |volume=Vol-1987 |authors=Alexander Kel'manov,Vladimir Khandeev }} ==Algorithms with Performance Guarantee for Some Quadratic Euclidean Problems of 2-Partitioning a Set and a Sequence== https://ceur-ws.org/Vol-1987/paper43.pdf
     Algorithms with Performance Guarantee for Some
    Quadratic Euclidean Problems of 2-Partitioning a Set
                      and a Sequence

                   Alexander Kel’manov                                  Vladimir Khandeev
              Sobolev Institute of Mathematics                    Sobolev Institute of Mathematics
                  Acad. Koptyug avenue 4,                             Acad. Koptyug avenue 4,
                Novosibirsk State University                        Novosibirsk State University
                      Pirogova str. 1,                                    Pirogova str. 1,
                 630090 Novosibirsk, Russia                          630090 Novosibirsk, Russia
                     kelm@math.nsc.ru                                  khandeev@math.nsc.ru




                                                        Abstract
                       We consider the problems of 2-partitioning a finite set and a finite
                       sequence of points in Euclidean space into clusters (subsets or sub-
                       sequences) minimizing the sum of squared distances between cluster
                       elements and the corresponding cluster centers. It is assumed that the
                       center of one of the desired clusters is the origin, while the center of
                       the other cluster is unknown and determined as the mean value over
                       cluster elements. In this work, we present a short survey on some re-
                       sults for these problems and a new result: a randomized algorithm for
                       the sequence partitioning problem.




1    Introduction
In this work, we consider two strongly NP-hard discrete optimization problems. The goal of our work is to
present a short survey on some recent results for these problems.
   The problems under consideration are closely related to the well-known strongly NP-hard 2-MSSC (Minimum
Sum-of-Squares Clustering) problem [Aloise et al., 2009]. This problem (also known as 2-Means) is to find a
partition of a finite set Y of points from a Euclidean space into two subsets (clusters) C and Y \ C such that
                                 ∑                  ∑
                                     ∥y − y(C)∥2 +       ∥y − y(Y \ C)∥2 → min,
                                   y∈C                  y∈Y\C

                   ∑                           ∑
                     y∈C y and y(Y \ C) = |C|     y∈Y\C y are the centroids (the geometric centers) of sets C and
                1                            1
where y(C) = |C|
Y \ C, respectively.
   The example of an input set of points on a plane is presented at Fig. 1. One needs to find a partition of this
set into two clusters and find their centroids. It is assumed that the points are scattered around these centroids
(the points of different clusters are colored differently).

Copyright ⃝
          c by the paper’s authors. Copying permitted for private and academic purposes.
In: Yu. G. Evtushenko, M. Yu. Khachay, O. V. Khamisov, Yu. A. Kochetov, V.U. Malkova, M.A. Posypkin (eds.): Proceedings of
the OPTIMA-2017 Conference, Petrovac, Montenegro, 02-Oct-2017, published at http://ceur-ws.org




                                                            298
                                       Figure 1: Example of the input set

   The main difference of the considered problems from the 2-MSSC problem is that the center of one the
clusters is fixed at a certain point in the Euclidean space (see the following sections); without loss of generality
we can assume that this point is the origin. Additionally, in one of the problems under consideration the input
is a sequence rather than a set, and there are constraints on the indices of elements of subsequences that are
included in one cluster or another. Despite the similarities, the considered problems and the 2-MSSC problem
are not equivalent. Therefore, the both considered problems need an individual study. Also, the problems under
consideration are important for applications, e.g. for approximation theory, statistics, data mining, the remote
sensing, geophysics, biometrics, medicine and technical diagnostics, processing the speech signals, electronic
intelligence, radiolocation, hydroacoustics, etc.

2   Partitioning a Set
Problem 1.
  Given: a set Y = {y1 , . . . , yN } of points from Rq and a positive integer number M .
  Find: a subset C ⊆ Y that minimizes the objective function
                                                  ∑                 ∑
                                          S(C) =     ∥y − y(C)∥2 +       ∥y∥2 ,
                                              y∈C                y∈Y\C

                   ∑
                    y∈C y is the centroid (the geometric center) of set C, under constrain |C| = M .
              1
where y(C) = |C|




                                       Figure 2: Example of the input set




                                                       299
   The example of an input set of points on a plane is presented at Fig. 2. One needs to find a partition of this
set into two clusters. Unlike 2-MSSC problem, in Problem 1 one needs to find the centroid of only one cluster.
It is assumed that the points of this cluster (shown by dark color) are scattered around its unknown centroid
and that the points of the second one (shown by white color) are scattered around the origin.
   The strong NP-hardness of Problem 1 is shown in [Kel’manov & Pyatkin, 2009]. In the case of fixed space
dimension the problem is solvable in polynomial time [Gimadi et al., 2010, Shenmaier, 2016].
   A 2-approximation algorithm for Problem 1 is presented in [Dolgushev & Kel’manov, 2011]. The running time
of the algorithm is O(qN 2 ). A polynomial-time approximation scheme for Problem 1 with a O(qN 2/ε+1 (9/ε)3/ε )-
time complexity, where ε is an arbitrary relative error, is substantiated in [Dolgushev et al., 2015].
   In [Kel’manov & Khandeev, 2015a], a randomized algorithm for Problem 1 is presented. For an established
parameter value, the algorithm finds an approximate solution of the problem in polynomial time for given
values of the relative error and failure probability; the conditions are established under which the algorithm is
asymptotically exact and runs in polynomial time.
   Algorithm A1 (randomized algorithm).
   Input: Y, M , and a positive integer parameter k.
   Step 1. Generate a multiset T by independently and randomly choosing k elements one after another (with
replacement) from Y.
   Step 2. For each nonempty H ⊆ T , compute the centroid y(H) and form a subset BM (y(H)) consisting
of the M points from Y, having the largest projections onto the direction specified by this centroid. Compute
S(BM (y(H))).
   Step 3. In the family of solutions found at Step 2, choose the subset CA = BM (y(H)) for which S(BM (y(H)))
is minimal. If there are several solutions with the same minimal value, take an arbitrary of them.
   Output: CA .
   Theorem 1. For arbitrary δ ∈ (0, 1) and positive integer t ≤ k, algorithm A1 finds a (1 + δt                1
                                                                                                                )-
approximate solution of Problem 1 in O(2 q(k + N )) time with probability not less than 1 − (δ + α), where
                                             k
     ∑t−1 ( )( )i (          )k−i
α = i=0 ki M    N     1− M N      .
   Corollary 1. Assume that M ≥ βN , where β ∈ (0, 1) is a constant. Then, given ε > 0 and γ ∈ (0, 1) for
the fixed parameter k = max(⌈ β2 ⌈ γε
                                   2
                                     ⌉⌉, ⌈ β8 ln γ2 ⌉) algorithm A1 finds a (1 + ε)-approximate solution of Problem 1
with probability not less than 1 − γ in O(qN ) time.
   Below are the conditions for algorithm A1 to be asymptotically exact.
   Theorem 2. Under the conditions of Theorem 1, let k = ⌈log2 N ⌉, δ = (log2 N )−1/2 , t = ⌈ kM  2N ⌉. Assume that
M ≥ βN , where β ∈ (0, 1) is a constant. Then algorithm A1 finds a (1 + εN )-approximate solution of Problem 1
                                                                                                     β
with probability 1 − γN in O(qN 2 ) time, where εN ≤ β2 (log2 N )−1/2 , γN ≤ (log2 N )−1/2 + N − 8 ln 2 .
   A pseudopolynomial algorithm which finds an optimal solution in the case of integer components of the points
in the input set and fixed space dimension is proposed in [Kel’manov & Khandeev, 2015b].
   Algorithm A2 (exact pseudopolynomial algorithm).
   Input: Y, M .
   Step 1. Given x ∈ G, where G is a multidimensional cubic uniform in each coordinate grid of size 2D with the
          1
distance M  between the nodes and the center at the origin, and D is the maximal modulus of the coordinates
of input points, construct the subset BM (x) consisting of the   ∑ M points from Y, having
                                                                                       ∑      the largest projections
onto the direction specified by x. Calculate Q(BM (x), x) = y∈BM (x) ∥y − x∥2 + y∈Y\BM (x) ∥y∥2 .
   Step 2. Find the point xA = arg min Q(BM (x), x) (if there are several points with the same minimal value,
                                      x∈G
take an arbitrary of them) and the corresponding subset BM (xA ). As a solution to the problem, take CA =
BM (xA ).
   Output: CA .
   Theorem 3. Suppose that, under the conditions of Problem 1, the points of Y have integer components lying
in [−D, D]. Then algorithm A2 finds an optimal solution in O(qN (2M D + 1)q ) time.
   In the case of fixed space dimension q the algorithm is pseudopolynomial, since under this condition, the
running time of the algorithm is estimated by O(N (M D)q ).
   In [Kel’manov & Khandeev, 2016], it was proved that, unless P=NP, in the general case of Problem 1 there
is no fully polynomial-time approximation scheme (FPTAS). In addition, such a scheme was presented for the
case of fixed space dimension.
   Algorithm A3 (approximation scheme).




                                                        300
   Input: Y, M and ε > 0.
   For each point y ∈ Y Steps 1–5 are executed.
   Step 1. Construct the set BM (y) of the M points from Y, having the largest projections onto the direction
specified by y.                            √                         √
                                                 2ε                          1
    Step 2. Compute S(BM (y)), h(y, ε) =        qM S(BM (y)) and H(y) =      M S(BM (y)).
   Step 3. If S(BM (y)) = 0, then return the set BM (y) as a result CA produced by the algorithm and exit;
otherwise, go to the next step.
   Step 4. Construct a multidimensional cubic uniform in each coordinate grid G(y, h, H + h/2) of size 2H + h
with the distance h between the nodes and the center at the point y.
   Step 5. For each node x of grid G(y, h, H + h/2), construct the set BM (x) of the M points from Y, having
the largest projections onto the direction specified by x, and compute S(BM (x)).
   Step 6. In the family {BM (x) | x ∈ G(y, h, H + h/2), y ∈ Y} of sets, choose, as a solution CA , the set BM (x)
for which S(BM (x)) is minimal. If there are several solutions with the same minimal value, take an arbitrary of
them.
   Output: CA .
   Theorem
        √      4. For any fixed ε > 0 algorithm A3 finds a (1 + ε)-approximate solution of Problem 1 in
O(qN 2 ( 2q      q
          ε + 2) ) time.
  In the case of fixed space dimension q the algorithm implements an FPTAS, since under this condition, the
running time of the algorithm is estimated by O(N 2 (1/ε)q/2 ).

3    Partitioning a Sequence
Problem 2.
  Given: a sequence Y = (y1 , . . . , yN ) of points from Rq , and some positive integer numbers Tmin , Tmax and M .
  Find: a subset M = {n1 , . . . , nM } ⊆ N = {1, . . . , N } such that
                                            ∑                       ∑
                              F (M) =           ∥yj − y(M)∥2 +          ∥yi ∥2 → min,
                                          j∈M                  i∈N \M

              1
                    ∑
where y(M) = |M|        i∈M yi , under constraints

                                1 ≤ Tmin ≤ nm − nm−1 ≤ Tmax ≤ N, m = 2, . . . , M,                              (1)

on the elements of (n1 , . . . , nM ).
  In this problem, (1) define the constraints on the elements of the cluster {yi | i ∈ M} with unknown centroid.
  In [Kel’manov & Pyatkin, 2013], the variant of Problem 2 was under study with Tmin and Tmax as parameters,
and it was established that Problem 2 is strongly NP-hard for every Tmin < Tmax . In the trivial case when
Tmin = Tmax , this problem is solvable in polynomial time.
  A 2-approximation polynomial algorithm with complexity O(N 2 (M N + q)) was proposed in
[Kel’manov & Khamidullin, 2014].
  For the case where the components of the points are integers and the dimension q is fixed, the exact pseu-
dopolynomial algorithm was introduced in [Kel’manov et all., 2017a].
  Algorithm A4 (exact pseudopolynomial algorithm).
  Input: Y, Tmin , Tmax , M .
  Step 1. For each point x ∈ G, where G is a multidimensional cubic uniform in each coordinate grid of size
                        1
2D with the distance M       between the nodes and the center at the origin, and D is the maximal modulus of the
coordinates of input points, find the optimal solution Mx of the problem
                                                 ∑
                                       Gx (M) =      (2⟨yn , x⟩ − ∥x∥2 ) → max                               (2)
                                                 n∈M

under constraints (1) on the elements of M. Compute Gx (Mx ).
  Step 2. Find the point xA = arg max Gx (Mx ) (if there are several points with the same maximal value, take
                                       x∈G
an arbitrary of them) and the corresponding subset MA = MxA .
  Output: MA .




                                                        301
   At Step 1 the optimal solution of the problem (2) is found in O(N (M (Tmax − Tmin + 1) + q)) time using the
dynamic programming scheme [Kel’manov & Khamidullin, 2014].
   Theorem 5. Suppose that, under the conditions of Problem 2, the points of Y have integer components lying
in [−D, D]. Then algorithm A4 finds an optimal solution in O(N (M (Tmax − Tmin + 1) + q)(2M D + 1)q ) time.
   In the case of fixed space dimension q the algorithm is pseudopolynomial, since under this condition, the
running time of the algorithm is estimated by O(N 3 (M D)q ).
   In [Kel’manov et al., 2016], an approximation algorithm was proposed which for the fixed dimension q of the
space implements the FPTAS scheme.
   Algorithm A5 (approximation scheme).
   Input: Y, Tmin , Tmax , M and ε > 0.
   For each point y ∈ Y Steps 1–5 are executed.
                                                           √ solution M of problem (2) for x = y.
                                                                         y
   Step 1. Find [Kel’manov & Khamidullin, 2014] an optimal
                                    √
                                      2ε                     1
   Step 2. Compute F (My ), h = qM       F (My ) and H = M     F (My ).
   Step 3. If F (My ) = 0, then return the set My as a result MA produced by the algorithm and exit; otherwise,
go to the next step.
   Step 4. Construct a multidimensional cubic uniform in each coordinate grid G(y, h, H + h/2) of size 2H + h
with the distance h between the nodes and the center at the point y.
   Step 5. For each node x of grid G(y, h, H + h/2) construct [Kel’manov & Khamidullin, 2014] an optimal
solution Mx of problem (2) and compute F (Mx ).
   Step 6. In the family {Mx | x ∈ G(y, h, H + h/2), y ∈ Y} of sets, choose, as a solution MA , the set Mx , for
which F (Mx ) is minimal. If there are several solutions with the same minimal value, take an arbitrary of them.
   Output: MA .
   Theorem 6. For any fixed √      ε > 0 algorithm A5 finds a (1 + ε)-approximate solution of Problem 2 in
O(N 2 (M (Tmax − Tmin + 1) + q)( 2q         q
                                     ε + 2) ) time.
   In the case of fixed space dimension q the algorithm implements an FPTAS, since under this condition, the
running time of the algorithm is estimated by O(M N 3 (1/ε)q/2 ).
   In [Kel’manov et al., 2017b], a randomized algorithm for Problem 2 is presented.
   Algorithm A6 (randomized algorithm).
   Input: Y, Tmin , Tmax , M , and a positive integer parameter k.
   Step 1. Generate a multiset T by independently and randomly choosing k elements one after another (with
replacement) from Y.
   Step 2.        For each nonempty H              ⊆     T , compute the centroid y(H) and construct
[Kel’manov & Khamidullin, 2014] an optimal solution Mx of problem (2) for x = y(H).
   Step 3. In the family of solutions found at Step 2, of sets, choose a set MA = Mx for which F (Mx ) is
minimal. If there are several solutions with the same minimal value, take an arbitrary of them.
   Output: MA .
   Theorem 7. For arbitrary δ ∈ (0, 1) and positive integer t ≤ k, algorithm A6 finds a (1 + δt   1
                                                                                                   )-approximate
solution of Problem 2 in O(2 (qk + N (M (Tmax − Tmin + 1) + q))) time with probability not less than 1 − (δ + α),
                              k
            ∑t−1 ( )( )i (         )k−i
where α = i=0 ki M    N      1 − M
                                 N      .
   Corollary 2. Assume that M ≥ βN , where β ∈ (0, 1) is a constant. Then, given ε > 0 and γ ∈ (0, 1) for
the fixed parameter k = max(⌈ β2 ⌈ γε
                                  2
                                     ⌉⌉, ⌈ β8 ln γ2 ⌉) algorithm A6 finds a (1 + ε)-approximate solution of Problem 2
with probability not less than 1 − γ in O(qM N 2 ) time.
   Theorem 8. Under the conditions of Theorem 7, let k = ⌈log2 N ⌉, δ = (log2 N )−1/2 , t = ⌈ kM       2N ⌉. Assume
that M ≥ βN , where β ∈ (0, 1) is a constant. Then algorithm A6 finds a (1 + εN )-approximate solution of
Problem 2 with probability 1 − γN in O(qM N 2 (Tmax − Tmin + 1)) time, where εN ≤ β2 (log2 N )−1/2 , γN ≤
                     β
(log2 N )−1/2 + N − 8 ln 2 .
   Since Tmax − Tmin + 1 ≤ N , under the specified conditions the running time of the algorithm is estimated by
O(qM N 3 ).

4   Conclusion
In the paper we present a short survey on some results for two problems of 2-partitioning a finite set and a
finite sequence of points in Euclidean space into clusters. Also we present a new randomized algorithm for the




                                                        302
sequence partitioning problem. For an established parameter value, the algorithm finds an approximate solution
in polynomial time for fixed values of the relative error and failure probability. The conditions are found under
which the algorithm is asymptotically exact and has polynomial-time complexity.
   The issue of great interest is the substantiation of faster randomized algorithms for the problems and the
search for subclasses of the problems for which linear time and sub-linear time randomized algorithms can be
designed.

Acknowledgements
This work was supported by Russian Foundation for Basic Research, Projects 15-01-00462, 16-31-00186, 16-07-
00168.

References
[Aloise et al., 2009] Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). NP-hardness of Euclidean sum-
          of-squares clustering. Mach. Learn., 75(2), 245–248.
[Kel’manov & Pyatkin, 2009] Kel’manov, A. V., & Pyatkin, A.V. (2009). Complexity of certain problems of
        searching for subsets of vectors and cluster analysis. Comput. Math. and Math. Phys., 49(11), 1966–
        1971.
[Gimadi et al., 2010] Gimadi, E. Kh., Pyatkin, A. V., & Rykov, I. A. (2010). On polynomial solvability of some
         problems of a vector subset choice in a Euclidean space of fixed dimension. J. Appl. Ind. Math., 4(1),
         48–53.
[Shenmaier, 2016] Shenmaier, V. V. (2016). Solving some vector subset problems by Voronoi diagrams. J. Appl.
        Ind. Math., 10(4), 560–566.
[Dolgushev & Kel’manov, 2011] Dolgushev, A. V., & Kel’manov, A. V. (2011). An approximation algorithm for
         solving a problem of cluster analysis. J. Appl. Ind. Math., 5(4), 551–558.
[Dolgushev et al., 2015] Dolgushev, A. V., Kel’manov, A. V., & Shenmaier, V. V. (2015). Polynomial-time ap-
         proximation scheme for a problem of partitioning a finite set into two clusters (in Russian). Tr. Inst.
         Mat. Mekh., 21(3), 100–109.
[Kel’manov & Khandeev, 2015a] Kel’manov, A. V., & Khandeev, V. I. (2015). A randomized algorithm for two-
        cluster partition of a set of vectors. Comput. Math. Math. Phys., 55(2), 330–339.
[Kel’manov & Khandeev, 2015b] Kel’manov, A. V., & Khandeev, V. I. (2015). An exact pseudopolynomial al-
        gorithm for a problem of the two-cluster partitioning of a set of vectors. J. Appl. Ind. Math., 9(4),
        497–502.
[Kel’manov & Khandeev, 2016] Kel’manov, A. V., & Khandeev, V. I. (2016). Fully polynomial-time approxima-
        tion scheme for a special case of a quadratic euclidean 2-clustering problem. J. Appl. Ind. Math., 56(2),
        334–341.
[Kel’manov & Pyatkin, 2013] Kel’manov A. V., & Pyatkin A. V. (2013). On Complexity of Some Problems of
        Cluster Analysis of Vector Sequences. J. Appl. Indust. Math., 7(3), 363–369.
[Kel’manov & Khamidullin, 2014] Kel’manov A. V., & Khamidullin S. A. (2014). An Approximating Polynomial
        Algorithm for a Sequence Partitioning Problem. J. Appl. Indust. Math., 8(2), 236–244.
[Kel’manov et all., 2017a] Kel’manov A.V., Khamidullin S.A., & Khandeev V.I. (2017). Exact Pseudopolynomial
        Algorithm for one Sequence Partitioning Problem. Automat. Remote Control, 78(1), 66–73.
[Kel’manov et al., 2016] Kel’manov A.V., Khamidullin S.A., & Khandeev V.I. (2016). A Fully Polynomial-Time
        Approximation Scheme for a Sequence 2-Cluster Partitioning Problem. J. Appl. Ind. Math., 10(2),
        209–219.
[Kel’manov et al., 2017b] Kel’manov A.V., Khamidullin S.A., & Khandeev V.I. (2017). A randomized algorithm
        for a sequence 2-clustering problem (in Russian). Comput. Math. and Math. Phys. (accepted).




                                                      303