On Some Finite Set Clustering Problems
                          in Euclidean Space

                   Alexander Kel’manov                                     Artem Pyatkin
              Sobolev Institute of Mathematics                    Sobolev Institute of Mathematics
                 Acad. Koptyug avenue, 4,                            Acad. Koptyug avenue, 4,
                 630090 Novosibirsk, Russia                          630090 Novosibirsk, Russia
                Novosibirsk State University                        Novosibirsk State University
                      Pirogova str. 1,                                    Pirogova str. 1,
                630090 Novosibirsk, Russia.                         630090 Novosibirsk, Russia.
                     kelm@math.nsc.ru                                   artem@math.nsc.ru


                                                        Abstract

                       Problems of partitioning a ﬁnite set of Euclidean points (vectors) into
                       clusters are considered. The criterion is minimizing the sum over all
                       clusters of: (1) normalized by the cardinality squared norms of the sum
                       of cluster elements, (2) squared norms of the sum of cluster elements,
                       (3) norms of the sum of cluster elements. It is proved that all these
                       problems are strongly NP-hard if the number of clusters is a part of the
                       input, and NP-hard in the ordinary sense if the number of clusters is
                       not a part of the input (is ﬁxed). Moreover, the problems are NP-hard
                       even in the case of dimension 1 (on a line).


1    Introduction
The subject of this study includes several related discrete optimization problems of partitioning a ﬁnite set of
Euclidean points into a family of clusters. Our goal is to analyze the computational complexity of these problems.
This study was motivated by the lack of published results concerning the complexity status of these problems and
by their importance, speciﬁcally, for combinatorial geometry, statistics, physics, mathematical problems of clus-
tering, and interdisciplinary problems regarding data mining and big data. This paper supplements and develops
the results obtained in [Kel’manov & Pyatkin, 2016], [Eremeev et al., 2016], [Kel’manov & Pyatkin, 2009].


2    Problems Formulation
Throughout this paper, R denotes the set of real numbers and ∥·∥ is the Euclidean norm. The following problems
are considered.
   Problem 1 (Minimum sum of normalized squares of norms clustering).
   Given a set Y = {y1 , . . . , yN } of points from Rq and a positive integer J > 1.

Copyright ⃝
          c by the paper’s authors. Copying permitted for private and academic purposes.
In: Yu. G. Evtushenko, M. Yu. Khachay, O. V. Khamisov, Yu. A. Kochetov, V.U. Malkova, M.A. Posypkin (eds.): Proceedings of
the OPTIMA-2017 Conference, Petrovac, Montenegro, 02-Oct-2017, published at http://ceur-ws.org


                                                            310
  Find a partition C1 , . . . , CJ of Y into nonempty clusters (subsets) such that

                                            ∑J
                                                 1 ∑ 2
                                                      y −→ min .
                                            j=1
                                                |Cj |
                                                     y∈Cj


  Problem 2 (Minimum sum of squared norms clustering).
  Given a set Y = {y1 , . . . , yN } of points from Rq and a positive integer J > 1.
  Find a partition C1 , . . . , CJ of Y into nonempty clusters such that

                                              ∑
                                              J ∑             2
                                                          y       −→ min .
                                              j=1 y∈Cj


  Problem 3 (Minimum sum-of-norms clustering).
  Given a set Y = {y1 , . . . , yN } of points from Rq and a positive integer J > 1.
  Find a partition C1 , . . . , CJ of Y into nonempty clusters such that

                                               ∑
                                               J ∑
                                                          y −→ min .
                                               j=1 y∈Cj

   These partitioning problems can be interpreted as optimal summing ones. On the other hand, they have an
obvious geometric interpretation, namely, in a ﬁnite set of Euclidean points, we search for a family of disjoint
geometric structures (clusters) with an extremal property described by the corresponding objective function.
   The example of an input set of 1000 points from two clusters on a plane is presented at Fig. 1. The points
of both clusters are scattered around the origin (the points of diﬀerent clusters are colored diﬀerently). One
needs to ﬁnd a partition of this set into two clusters in accordance with the objective function of one of above
formulated problems.
3[mm]


                              Figure 1: Example of the input set with two clusters

   Another example of an input set of 500 points on a plane is presented at Fig. 2. In this example, the points
from three clusters are scattered around the origin.
   In [Kel’manov & Pyatkin, 2009] the similar maximization problems were considered. In particular, it was
proved that Problem 1 on maximum is strongly NP-hard. Since changing the optimization direction can vary
the complexity quite unpredictably, ﬁnding out the complexity status of Problems 1–3 looks interesting.
   On the other hand, Problems 1–3 have some applications in Data mining, statistics and natural sciences. But
ﬁrst let us remind some relative problems that were studied in [Eremeev et al., 2016].


                                                          311
                              Figure 2: Example of the input set with three clusters

  Problem 4 (Subset with the minimum normalized length of vectors sum).
  Given a set Y = {y1 , . . . , yN } of points from Rq .
  Find a nonempty subset C ⊆ Y such that
                                                1 ∑ 2
                                                   y −→ min .
                                               |C|
                                                   y∈C

  Problem 5 (Subset with shortest vectors sum, arbitrary cardinality).
  Given a set Y = {y1 , . . . , yN } of points from Rq .
  Find a nonempty subset C ⊆ Y such that
                                                    ∑
                                                         y −→ min .
                                                  y∈C

  Problem 6 (Subset with shortest vectors sum, given cardinality).
  Given a set Y = {y1 , . . . , yN } of points from Rq and a positive integer M .
  Find a subset C ⊆ Y of cardinality M such that
                                                    ∑
                                                        y −→ min .
                                                  y∈C

   It was proved in [Eremeev et al., 2016] that if the dimension of the space is a part of the input then Problems 4–
6 are NP-hard in the strong sense, and if the dimension of the space is ﬁxed then they are NP-hard in the
ordinary sense even on the plane (q = 2). It was also shown there that there are no approximation algorithms
with guaranteed performance for them unless P=NP. However, if the coordinates of the input points are integer
and the dimension of the space is bounded by a constant then Problems 4–6 can be solved in a pseudopolynomial
time.
   From the natural sciences point of view, the objective function of Problems 4–6 can be treated as searching
for a subset of compensated (balanced) forces. Then Problems 1–3 can be interpreted as physical problems of
partitioning a set of forces into subsets of collectively balanced forces. If the elements of the subset are treated
as the sequence of coordinates of the jth Brownian particle that moves in space starting at the origin, then is
the length of the total displacement of this particle from the starting position under the action of chaotic jolts.
Therefore, Problems 13 can be treated as ones of partitioning a set of points into Brownian clusters with the
minimum sum of mean squared displacements (Problem 1), with the minimum sum of squared displacements
(Problem 2), or with the minimum sum of displacements (Problem 3).
   On the other hand, Problems 1–3 have roots in interdisciplinary problems related to data mining and big data
(see, e.g., [Aggarwal, 2015, Bishop, 2006, Hastie et al., 2001]). In these problems, a major task is to partition a


                                                         312
set into clusters consisting of objects that are similar or close according to some criterion, that can be of various
types. The partitioning criteria in Problems 1–3 have not been investigated thus far.
   Finally, Problems 1–3 have applications in statistics. Indeed, the input set can be treated as an inhomogeneous
sample of several distributions where the correspondence of sample elements to distributions is not given. Is it
true that these unknown distributions have diﬀerent ﬁnite variances and identical (zero) expectations? Finding
the optimal solution of Problems 1–3 could give an answer to this question. Indeed, if the hypothesis is true then
by Central limit theorem for each of the optimal samples (i. e. clusters C1∗ , . . . , CJ∗ ) the following convergence in
distribution holds:                                   ∑
                                                1
                                                  ∗       y −−∗−−−→ Φ0,1 ,
                                             σj |Cj |   ∗
                                                             |Cj |→∞
                                                     y∈Cj

where σj is the dispersion of jth distribution and Φ0,1 is Gaussian distribution with parameters (0, 1). Therefore,
the standard statistical hypothesis testing methods can be applied for each of the optimal subsets.
   Thus, the search for an optimal solution of Problems 1–3 in polynomial time is an important task from both
theoretical and application points of view. In this paper, we show that Problems 1–3 are strongly NP-hard if the
number of clusters is a part of the input and are NP-hard in the ordinary sense if the number of clusters is not
a part of the input. Moreover, all these problems are NP-hard even for q = 1, i.e., on a line. This result means
that all the applied problems mentioned above are hard to solve.

3     Main Results
The ﬁrst result of this paper consists in the following
   Theorem 1. Problems 1–3 are strongly NP-hard even for the dimension q = 1.
   Proof. First let us present the Problems 1–3 for q = 1 in the properties veriﬁcation form.
   Problems 1–3.
   Given the set Y = {y1 , . . . , yN } of reals (“vectors” of dimension 1), positive integer J > 1 and positive K.
   Question: Is there a prstition of the set Y into nonempty subsets (clusters) C1 , . . . , CJ such that the corre-
sponding objective function is at most K?
   To prove the theorem we show that the following well-known strongly NP-hard problem
[Garey & Johnson, 1979] is polynomially reduced to Problems 1–3.
   3-Partition problem.
   Given a positive integer B and a set A of 3n positive integers from the interval (B/4, B/2) such that their
sum is equal to nB.
   Question: can it be partitioned into n subsets such that the sum of the numbers in each of them is equal to
B?
   Then we use the following reduction. Given an arbitrary instance of 3-Partition construct the following joint
instance of the Problems 1–3.
   Put K = 0, J = n and Y = A ∪ B, where

                                  B = {nB + iα, −(nB + B + iα) | i = 1, 2, . . . , n}

and
                                                    α = 1/(n + 1).
   Assume that in the instance of 3-Partition the set A can be partitioned into n subsets such that the sum of
the elements in each subset is B. Then adding to the ith of these subsets (i = 1, 2, . . . , n) the numbers nB + iα
and −(nB + B + iα) from the set B results in the partition of the set Y in Problems 1–3 into J = n clusters
with all three objective functions equal to 0.
   Now suppose that in Problems 1—3 the set Y can be partitioned into J = n clusters such that the corre-
sponding objective function is 0. Then, clearly, the sum of the elements inside each cluster is 0. Hence, each
cluster contains at least one negative number. Since there are n negative numbers in Y, we can assume ith
cluster sontains the number −(nB + B + iα).
   Next, since
                              (nB + α) + (nB + 2α) = 2nB + 3α > (n + 1)B + nα,
each cluster contains at most one positive number from the set B. Since the addend −iα cannot be turned to
zero by the addends from the set A, the ith cluster must contain exactly one positive number from B, namely,


                                                            313
(nB + iα). But then the sum of other elements in the cluster (note that all of them are fromA) is equal to B,
i. e. they induce a partition of the set A into n subsets satisfying the requirements of 3-Partition.
    Theorem 1 is proved.
    Now denote by Problems 1(J)–3(J) the variants of the Problems 1–3 where the number of clusters J is not a
part of input. Here is their statement in the properties veriﬁcation form.
    Problems 1(J)–3(J).
    Given: Set Y = {y1 , . . . , yN } of reals and positive K.
    Question: Is there a partition of the set Y into nonempty clusters C1 , . . . , CJ such that the corresponding
objective function is at most K?
    The second, main result of our paper is the following
    Theorem 2. Problems 1(J)–3(J) are NP-hard even for the dimension q = 1.
    Proof. We use the reduction of the classic [Garey & Johnson, 1979] NP-hard
    Partition problem.
    Given a set A of positive integers whose sum is equal to 2W .
    Question: Can it be partitioned into two subsets with the sum of the numbers in each equal to W ?
    Given an arbitrary instance of Partition problem construct the following general instance of Problems 1(J)–
3(J). Put K = 0 and Y = A ∪ B, where the set

                             B = {−W, −3W, 2W } ∪ {−iW, iW | i = 4, 5 . . . , J + 1}.

   Assume the Partition instance contains two nonintersecting subsets such that the sum of elements in each of
them is equal to W . Consider the following partition of the set Y in Problems 1(J)–3(J). As ﬁrst two clusters
consider the optimal subsets from Partition, supplemented by the number −W and a couple of numbers 2W and
−3W , respectively. The remaining elements of the set Y are partitioned into J − 2 clusters of type {−iW, iW },
where i = 4, 5 . . . , J + 1. Clearly, the constructed J clusters is a solution of Problems 1(J)–3(J) providing value
0 of the objective function.
   Now let the Problems 1(J)–3(J) allow a solution with the corresponding objective function equal to 0. Note
that each of the objective functions of the Problems 1(J)–3(J) is 0 if and only if the sum of the elements in each
cluster is 0. Since Y contains J negative numbers, each of J clusters has exactly one negative number. Then it
is easy to show by induction on j from J + 1 down to 4 that the number jW must be in the same cluster as
−jW . Then clearly the number 2W shares a cluster with the number −3W , and the elements of A lying in the
last two clusters sums to W , i. e. they induce the desired partition of the set A.
   Theorem 2 is proved.
   Note that the simpler proofs of Theorems 1 and 2 for the multisets versions of Problems 1–3 were earlier
presented in [Kel’manov & Pyatkin, 2016].

4   Conclusion
The obtained “negative” results show that in spite of the simplicity of the statement of the considered problems,
there are no exact or even approximation polynomial algorithms unless P = N P . However, obtaining “posi-
tive” algorithmic results looks promising for special cases of these problems that contain additional restrictions
excluding the zero value of the objective function.

Acknowledgements
This work was supported by Russian Science Foundation, Project 16-11-10041.

References
[Aggarwal, 2015] Aggarwal C. C. (2015). Data Mining: The Textbook. (2015). Springer International Publishing

[Bishop, 2006] Bishop M. C. (2006). Pattern Recognition and Machine Learning. New York: Springer Sci-
         ence+Business Media, LLC.

[Garey & Johnson, 1979] Garey M. R., Johnson D. S. (1979). Computers and Intractability: A Guide to the The-
        ory of NP-Completeness. San Francisco: Freeman.


                                                        314
[Hastie et al., 2001] Hastie T., Tibshirani R., Friedman J. (2001). The Elements of Statistical Learning: Data
          Mining, Inference, and Prediction. New York: Springer-Verlag.
[Eremeev et al., 2016] Eremeev A.V., Kel’manov A.V., Pyatkin A.V. (2016). On the Complexity of
        Some Euclidean Optimal Summing Problems. Doklady Mathematics, 93(3), 286-288. doi:
        0.1134/S1064562416030157

[Kel’manov & Pyatkin, 2009] Kel’manov A. V., Pyatkin A. V. (2009). Complexity of Certain Problems of Search-
        ing for Subsets of Vectors and Cluster Analysis Comput. Math. Math. Phys. 49(11), 1966-1971. doi:
        10.1134/S0965542509110128
[Kel’manov & Pyatkin, 2016] Kel’manov A.V., Pyatkin A.V. (2016). On the Complexity of Some Euclidean
        Problems of Partitioning a Finite Set of Points. Doklady Mathematics, 94(3), 635-638. doi:
        10.1134/S1064562416060089


                                                     315