Algorithms with Performance Guarantee for a Weighted
                 2-partition Problem

                   Alexander Kel’manov                                     Anna Motkova
              Sobolev Institute of Mathematics                    Sobolev Institute of Mathematics
                  Acad. Koptyug avenue 4,                             Acad. Koptyug avenue 4,
                Novosibirsk State University                        Novosibirsk State University
                      Pirogova str. 1,                                     Pirogova str. 1,
                 630090 Novosibirsk, Russia                          630090 Novosibirsk, Russia
                     kelm@math.nsc.ru                                     anitamo@mail.ru


                                                        Abstract
                       We consider the problem of partitioning a ﬁnite set of Euclidean points
                       into two clusters minimizing the sum over both clusters the weighted
                       sums of the squared intracluster distances from the elements of the
                       clusters to their centers. The center of one of the clusters is unknown
                       and determined as the average value over all points in the cluster, while
                       the center of the other cluster is the origin. The weight factors for both
                       intracluster sums are the cardinalities of the corresponding clusters. In
                       this work, we present a short survey on the results for this problem and
                       a new result: a 2-approximation algorithm.


1    Introduction
In this work, we consider a strongly NP-hard problem of discrete optimization. The goal of our work is to present
a short survey on the results for these problems.
   The problem under study is closely related to the well-known strongly NP-hard
   Weighted variance-based 2-clustering Problem
   Given a set Y = {y1 , . . . , yN } of points from Rq . Find a partition of Y into two non-empty clusters C and
Y \ C such that                  ∑                         ∑
                           |C|       ∥y − y(C)∥2 + |Y \ C|      ∥y − y(Y \ C)∥2 −→ min ,
                                y∈C                         y∈Y\C
                   ∑                              ∑
where y(C) = |C| y∈C y and y(Y \ C) = |C| y∈Y\C y are the geometric centers (centroids) of both clusters C
                 1                              1

and Y \ C.
   The main diﬀerence between the problem under study and this problem is that in the considered problem
the center of one of the clusters is ﬁxed in the origin (without loss of generality). It is obvious, that the
considered problem and the Weighted variance-based 2-clustering problem are not equivalent and so both of
them need an individual study. The importance of the problem for applications motivates to continue researches,
e.g. importance for geometric problems, approximation problems, statistical problems of joint evaluations and

Copyright ⃝
          c by the paper’s authors. Copying permitted for private and academic purposes.
In: Yu. G. Evtushenko, M. Yu. Khachay, O. V. Khamisov, Yu. A. Kochetov, V.U. Malkova, M.A. Posypkin (eds.): Proceedings of
the OPTIMA-2017 Conference, Petrovac, Montenegro, 02-Oct-2017, published at http://ceur-ws.org


                                                            304
testing hypotheses by nonuniform samples, problems of Data clustering, Data mining, Machine learning, Big
data, applied problems in technical and medical diagnostics, etc.

2    Cardinality-weighted Variance-based 2-clustering with Given Center
Problem 1. Given a set Y = {y1 , . . . , yN } of points from Rq and a positive integer M . Find a partition of Y
into two non-empty clusters C and Y \ C such that
                                              ∑                              ∑
                               F1 (C) = |C|         ∥y − y(C)∥2 + |Y \ C|           ∥y∥2 −→ min ,                        (1)
                                              y∈C                           y∈Y\C

                   ∑
              1
where y(C) = |C|         y is the geometric center (centroid) of C and such that |C| = M .
                   y∈C
   Problem 1 is strongly NP-hard [Kel’manov & Pyatkin, 2015, Kel’manov & Pyatkin, 2016]. Therefore, by
[Garey & Johnson, 1979], there are neither exact polynomial-time nor exact pseudopolynomial-time algorithms
for this problem, unless P=NP.
   Despite of this, a pseudopolynomial algorithm for the special case of Problem 1 exists. It proposed in
[Kel’manov & Motkova, 2016a] and ﬁnds an optimal solution in the case of integer components of the points
in the input set and ﬁxed space dimension.
   Algorithm A1 (exact pseudopolynomial algorithm).
   Input: A set Y and some positive integer M ≤ N .
   Step 1. Construct G — a multidimensional cubic uniform in each coordinate grid of size 2D with the distance
 1
M between the nodes and the center at the origin, and D is the maximal modulus of the coordinates of input
points.
   Step 2. For each node x ∈ G, calculate

                                      g x (y) = (2M − N )∥y∥2 − 2M ⟨y, x⟩ , y ∈ Y;                                       (2)
ﬁnd the subset B x which consists of M points of the set Y, at which the function g x (y) has the smallest values.
Calculate F1 (Bx ) by (1).
  Step 3. In the family F1 (B x ), x ∈ G constructed at Step 2, ﬁnd the node xA1 = arg min F1 (Bx ) and the
                                                                                                          x∈G
subset B xA1 . Put CA1 = B xA1 .
   Output: The set CA1 .
   Theorem 1. Assume that the elements of Y have integer values from the interval [−D, D]. Then algorithm
A1 ﬁnds an optimal solution of Problem 1 in O(qN (2M D + 1)q ) time.
   In the case of ﬁxed space dimension q the algorithm is pseudopolynomial.
   In [Kel’manov & Pyatkin, 2015, Kel’manov & Pyatkin, 2016] it is also shown that for Problems 1
there does not exist any fully polynomial time approximation scheme (FPTAS), unless P=NP. In
[Kel’manov & Motkova, 2016b] was proposed such approximation scheme for a special case of the Problem 1
when the space dimension q is ﬁxed.
   Algorithm A2 (approximation scheme).
   Input: a set Y and numbers M and ε.
   For each point y ∈ Y Steps 1–6 are executed.
   Step 1. Compute the values g y (z), z ∈ Y, using formula (2); ﬁnd a subset By ⊆ Y with M smallest values
g (z), compute F (B y ) using formula (1).
 y

   Step 2. If F (B y ) = 0, then put CA3 = B y ; exit.                 √
                                     1
                                       √                             1   2ε
   Step 3. Compute H = H(y) = M          F1 (B y ) and h = h(y, ε) = M          y
                                                                         q F1 (B ).
    Step 4. Construct the lattice G(y, h, H + h/2) using formula

           G(y, h, H + h/2) = {d ∈ Rq | d = y + h · (i1 , . . . , iq ), ik ∈ Z, |hik | ≤ H + h/2, k ∈ {1, . . . , q}}.

   Step 5. For each node x of the lattice G(y, h, H + h/2) compute the values g x (y), y ∈ Y, using formula (2)
and ﬁnd a subset Bx ⊆ Y with M smallest values g x (y). Compute F1 (B x ) using formula (1), remember this
value and the set B x .
   Step 6. If F (B x ) = 0, then put CA2 = B x ; exit.


                                                               305
  Step 7. In the family {B x | x ∈ G(y, h, H + h/2), y ∈ Y} of candidate sets that have been constructed in Steps
1–6, choose as a solution CA2 the set B x for which F1 (B x ) is minimal.
  Output: the set CA2 .
  Theorem
  (     (√ 2. For  )q )any ﬁxed ε > 0 algorithm A2 ﬁnds a (1 + ε)-approximate solution of Problem 1 in
O qN 2       2q
              ε +2      time.
   Corollary 1. In the case when the dimension q of space is bounded by a constant value, this algorithm is an
FPTAS.
   Later in [Kel’manov et al., 2017] was proposed the similar approximation scheme with the same time com-
plexity for the generalization of Problem 1 in which the weight factors are given as input (positive real numbers).
Also there was(presented an improved algorithm     that allows to ﬁnd (1 + ε)-approximate solution of generalized
                 √ 2 ( πe )q/2 (√ 2      )q )
problem in O       qN     2         ε +2      time. In the case when the dimension q of space is bounded by a
constant value, this algorithm is also an FPTAS. In addition, improved algorithm remains polynomial even when
the dimension q of the space is bounded by the value C log N , where C is the positive constant. It is clear that
                                                                         1
in this case algorithm implements a PTAS with time complexity N O(log ε ) .
   The last proposed algorithm for Problem 1 until this moment is a 2-approximation algorithm. It was proposed
in [Kel’manov & Motkova, 2017] and allows us to ﬁnd an approximate solution in the polynomial time.
   Algorithm A3 (2-approximation algorithm).
   Input: N -elements set Y ⊂ Rq , natural number M ≤ N.
   For each point y ∈ Y Steps 1–2 are executed.
   Step 1. Compute the values g y (z), z ∈ Y, using formula (2); ﬁnd an M -elements subset B y ⊆ Y with the
smallest values g y (z), compute F1 (By ) using formula (1).
   Step 2. If F1 (By ) = 0, then put CA2 = B y ; exit.
   Step 3. In the family {By |, y ∈ Y} of candidate sets that have been constructed in Steps 1–2, choose as a
solution CA3 the set B x , for which F1 (Bx ) is minimal.
   Output: the set CA3 .
   Two examples of an input set (of 1000 points) and admissible solutions (i.e. 300-elements subset By ) found
by Algorithm A3 at step 1 are presented at Fig.1 (a) and (b).


                     (a)                                                                  (b)

                                                      Fig. 1.


  Let us substuntiate this algorithm below.


                                                       306
    The following two lemmas are well known.         (see, for example, [Kel’manov & Romanchenko, 2012,
Kel’manov & Romanchenko, 2014]).                                               ∑
    Lemma 1. For an arbitrary point x ∈ Rq , a ﬁnite set Z ⊂ Rq and z = |Z|1
                                                                                 z∈Z z (z is the centroid of Z),
it is true that                ∑               ∑
                                  ∥z − x∥2 =       ∥z − z∥2 + |Z| · ∥x − z∥2 .
                                           z∈Z                        z∈Z

   Lemma 2. For a ﬁnite set Z ⊂ R , if a point u ∈ Rq is closer (in terms of distance) to the centroid z of Z
                                                  q

than any point in Z, then            ∑                ∑
                                         ∥z − u∥2 ≤ 2     ∥z − z∥2 .
                                                          z∈Z                      z∈Z

  Lemma 3. Let                                    ∑                                    ∑
                                  S(C, x) = |C|            ∥y − x∥2 + |Y \ C|                ∥y∥2 , C ⊆ Y, x ∈ Rq .
                                                  y∈C                                y∈Y\C

Then the next statements are true:
           ∑ nonempty ﬁxed set C ⊆ Y the minimum of the function S(C, x) over x ∈ R is reached at the point
                                                                                   q
  (1) for any
        1
y(C) = |C|    y;
           y∈C
   (2) if |C| = M = const, then for any ﬁxed point x ∈ Rq the minimum of function S(C, x) over C ⊆ Y is reached
at the subset Bx that consists of M points of the set Y, at which the function (2) has the smallest values.
   Proof. The ﬁrst statement follows from Lemma 2 and the deﬁnition of the functions S and F . Since |Y| = N
and |C| = M , the second statement follows from the next chain of equalities:

                   ∑                                      ∑                   ∑                  ∑
  S(C, x) = M            ∥y − x∥2 + (N − M )                     ∥y∥2 = M            ∥y∥2 − 2M         ⟨y, x⟩ + M 2 ∥x∥2 +
                   y∈C                                y∈Y\C                   y∈C                y∈C
                            ∑             ∑             ∑{                         }                      ∑
           (N − M ) (             ∥y∥ −
                                     2
                                                ∥y∥ ) =
                                                      2
                                                          (2M − N )∥y∥2 − 2M ⟨y, x⟩ + M 2 ∥x∥2 + (N − M )   ∥y∥2 .
                            y∈Y           y∈C                   y∈C                                                                   y∈Y

It remains to note that in the last two equalities the last two addends do not depend on C.
   Lemma is proved.                                                                        (    )
   Theorem 3. An Algorithm A3 ﬁnds a 2-approximate solution of Problem 1 in time O qN 2 .
   Proof. Let C ∗ be an optimal subset and t = arg min∗ ||y − y(C ∗ )||2 be a point from a subset C ∗ that is the
                                                                             y∈C
closest to its centroid.
   Step-by-step, the algorithm goes through every points y of the set Y. Since t ∈ Y, a permissible solution also
will be formed for the point t: the subset B t , deﬁned by Lemma 3 (if x = t).
   Besides, a value F (B t ) of objective function for the Problem 1 will be calculated. For this value we have this
inequalities from deﬁnitions of the functions F and S and Lemma 3 and a deﬁnition of the point t:

                                                F (B t ) = S y(B ) (B t ) ≤ S t (Bt ) ≤ S t (C ∗ ).
                                                                       t
                                                                                                                                                (3)

  Applying Lemma 2 to the set Z = C ∗ and the point u = t, we have
                                  ∑                   ∑
                                       ||y − t||2 ≤ 2   ||y − y(C ∗ )||2 .
                                                  y∈C ∗                        y∈C ∗

Using this inequality and deﬁnition (2), we ﬁnd an estimation for a right part of (3)
                   ∑                                       ∑
  S t (C ∗ ) = M           ||y − t||2 + (N − M )                    ||y||2
                   y∈C ∗                                  y∈Y\C ∗
                                                                              ∑                                     ∑
                                                                      ≤ 2M           ||y − y(C ∗ )||2 + (N − M )             ||y||2 ≤ 2F (C ∗ ). (4)
                                                                             y∈C ∗                                 y∈Y\C ∗

Combining (3) and (4), we have
                                                                    F (Bt ) ≤ 2F (C ∗ ).                                                        (5)


                                                                             307
    From Step 3 we know that the algorithm ﬁnds a solution of Problem 1 in the next form:

                                               CA = arg min F (B y ).                                         (6)
                                                         y∈Y


    Since Bt ∈ {B y |y ∈ Y}, from (6) we have an inequality

                                                 F (CA ) ≤ F (B t ).                                          (7)

Finally, (5) and (7) implies an estimate
                                                F (CA ) ≤ 2F (C ∗ ).                                          (8)
   Let us consider the case when at Step 2 of the algorithm the condition F (B y ) = 0 is executed for some
input point y ∈ Y. For every subset C ⊆ Y inequality F (C) ≥ 0 is correct, so the subset CA = By ⊆ Y is an
optimal solution of Problem 1. Inequality (8) is correct for this solution too. It means that a subset CA is a
2-approximation solution of Problem 1.
   Let us estimate the time complexity of the algorithm. At Step 1 we need no more than O(qN ) operations to
calculate values g y (z). Searching of M the smallest elements in the set of N elements requires O(N ) operations
(for example, using an algorithm of ﬁnding n-th smallest value in an unordered array [Wirth, 1976]). Step 2
need constant time. So for any point y ∈ Y total execution time of Steps 1 and 2 is O(qN ).
   As Steps 1 and 2 are executed N times, total time complexity of this steps is O(qN 2 ). Time complexity of
Step 3 is estimated by value O(N ). So, the time complexity of the algorithm is O(qN 2 ). Theorem 3 is proved.


   Two examples of an input set (of 1000 points) and 2-approximate solutions found by Algorithm A3 are
presented at Fig.2 (a) (i.e. 400-elements subset CA ) and (b) (i.e. 600-elements subset CA ).


                      (a)                                                               (b)

                                                      Fig. 2.


3    Conclusion
In this work, we presented a short survey on the results for the problem of 2-partitioning of points in Euclidean
space into two clusters. Also we presented the new result: a 2-approximation algorithm.
   It seems important to continue studying and the issue of great interest is the substantiation of randomized
algorithms for this problem with linear or sub-linear time complexity.


                                                        308
Acknowledgements
The research for algorithms A1 and A2 was supported by the Russian Foundation for Basic Research, Projects
16-31-00186, 16-07-00168. Research for algorithm A3 was supported by the Russian Science Foundation, Project
16-11-10041.

References
[Kel’manov & Pyatkin, 2015] Kel’manov, A. V., & Pyatkin, A. V. (2015). NP-Hardness of Some Quadratic Eu-
        clidean 2-Clustering Problems. Dokl.Akad.Nauk, 464(5), 535–538.

[Kel’manov & Pyatkin, 2016] Kel’manov, A. V., & Pyatkin, A. V. (2016). On the Complexity of Some Quadratic
        Euclidean 2-Clustering Problems. Comput. Math. Math. Phys., 56(3), 491–497.

[Garey & Johnson, 1979] Garey, M. R., & Johnson, D. S. (1979). Computers and Intractability: A Guide to the
        Theory of NP-Completeness. San Francisco: Freeman.

[Kel’manov & Motkova, 2016a] Kel’manov, A. V., & Motkova, A. V. (2016). Exact Pseudopolynomial Algorithms
        for a Balanced 2-Clustering Problem. J. of Appl. and Ind. Math., 10(3), 349–355.

[Kel’manov & Motkova, 2016b] Kel’manov, A. V., & Motkova, A. V. (2016). A Fully Polynomial-Time Approxi-
        mation Scheme for a Special Case of a Balanced 2-Clustering Problem. LNCS, 9869, 182–192.

[Kel’manov et al., 2017] Kel’manov, A. V., & Motkova, A. V., & Shenmaier, V. V. (2017). Approximation
        Schemes for Some Quadratic Problems of Weighted 2-Partitioning a Set of Points. (in Russian) Tr.
        Inst. Mat. Mekh., 23(3), 159–170.
[Kel’manov & Motkova, 2017] Kel’manov, A. V., & Motkova, A. V. (2017). An Approximation Polynomial-Time
        Algorithm for a Weighted 2-Clustering Problem with Restriction on Clusters Cardinalities. (in Russian)
        Comput. Math. Math. Phys. (accepted)

[Kel’manov & Romanchenko, 2012] Kel’manov, A. V., & Romanchenko, S. M. (2012). An Approximation Algo-
        rithm for Solving a Problem of Search for a Vector Subset. J. Appl. Ind. Math., 6(1), 90–96.

[Kel’manov & Romanchenko, 2014] Kel’manov, A. V., & Romanchenko, S. M. (2014). An FPTAS for a Vector
        Subset Search Problem. J. Appl. Indust. Math., 8(3), 329–336.
[Wirth, 1976] Wirth, N. (1976). Algorithms + Data Structures = Programs. New Jersey: Prentice Hall.


                                                     309