Attainable Best Guarantee
         for the Accuracy of k-medians Clustering in [0, 1]

                                             Michael Khachay
                            Krasovsky Institute of Mathematics and Mechanics,
                               Ural Federal University, Ekaterinburg, Russia
                             Omsk State Technical University, Omsk, Russia
                                         mkhachay@imm.uran.ru

                                             Vasiliy Pankratov
                            Krasovsky Institute of Mathematics and Mechanics,
                                         pankratov.vs@gmail.com

                                             Daniel Khachay
                            Krasovsky Institute of Mathematics and Mechanics,
                               Ural Federal University, Ekaterinburg, Russia
                                            dmx@imm.uran.ru


                                                              Abstract
                       In this paper, one-dimensional k-medians clustering problem is con-
                       sidered in the context of zero-sum game between players choosing a
                       sample and partitioning it into clusters, respectively. For any sample
                       size n and k > 1, an attainable guaranteed value of the clustering accu-
                       racy 0.5n/(2k − 1) (the low value of an appropriate game) is provided
                       for samples taken from the segment [0, 1].

1    Introduction
In data analysis, k-medians clustering problem is regarded as one of the famous center-based metric clustering
problems, whose instance can be deﬁned as follows. For a given number k ≥ 1 and a ﬁnite sample ξ = (x1 , . . . , xn )
taken from a metric space (X, ρ), it is required to ﬁnd a partition of Nn = {1, . . . , n} onto k clusters C1 , . . . , Ck
and, for any j-th cluster, to point out an appropriate center cj such that

                             ∑
                             k ∑                        ∑
                                                        n
                                        ρ(xi , cj ) =         min{ρ(xi , c1 ), . . . , ρ(xi , ck )} → min .           (1)
                             j=1 i∈Cj                   i=1
                                                                       {∑                       }
Equation (1) evidently implies that, for any j, the point cj ∈ Arg min    i∈Cj ρ(xi , c) : c ∈ X , i.e. cj is a median
of the subsample ξj = (xi : i ∈ Cj ).
   As a combinatorial optimization problem, k-medians is shown1 to be intractable [Guruswami and Indyk, 2003]
even for the Euclidean metric and has no PTAS, unless P = N P . For d-dimensional Euclidean spaces there

Copyright ⃝
          c by the paper’s authors. Copying permitted for private and academic purposes.
In: Yu. G. Evtushenko, M. Yu. Khachay, O. V. Khamisov, Yu. A. Kochetov, V.U. Malkova, M.A. Posypkin (eds.): Proceedings of
the OPTIMA-2017 Conference, Petrovac, Montenegro, 02-Oct-2017, published at http://ceur-ws.org
   1 If k is a part of an instance.


                                                                   322
are known numerous approximation results. For instance, in [Kumar et al., 2010], for any ﬁxed k, randomized
                                        O(1)
LTAS with time complexity of O(2(k/ε)        · dn) is proposed. On the basis of the famous coresets technique,
in [Har-Peled and Mazumdar, 2004], RPTAS with polynomially depending on the number of clusters k time
complexity bound O(n + ρ(k log n)O(1) ), where ρ = exp(O((1 − log ε)/ε)d−1 ) is proposed. For d = 1, k-medians
problem is polynomially (and very eﬃciently) solvable. To date, the most eﬃcient exact algorithm with time
complexity O(n log n + kn) is proposed in [Grønlund et al., 2017].
   Among others, the setting, where it is required to obtain a guaranteed accuracy of clustering for a ﬁxed
number of clusters k and an arbitrary sample, is valuable ([Ben-David, 2015, Khachai and Neznakhina, 2017])
for applications in combinatorial optimization and data analysis. In this paper, we study such a setting for the
1d-case of the k-medians clustering problem.

2     Problem Statement and the Main Result
We consider the following two-player zero-sum game induced by k-medians clustering. There are two players
placing points in the unit segment of the real line. Strategies of the ﬁrst player are samples ξ = (x1 , . . . , xn ),
xi ∈ [0, 1] of some given
                     ∑n size n. Strategies of the second one are k-tuples σ = (c1 , . . . , ck ), ci ∈ [0, 1]. The payoﬀ
function F (ξ, σ) = i=1 min{|xi − c1 |, . . . , |xi − ck |}. Goals of the ﬁrst and the second players are to ﬁnd the
lower
                                        v∗ (n, k) = sup          inf F (ξ, σ)
                                                                          k
                                                         ξ∈[0,1]n σ∈[0,1]
and the higher
                                          v ∗ (n, k) =     inf      sup F (ξ, σ)
                                                         σ∈[0,1]k ξ∈[0,1]n

values of the game, respectively.
    It is easy to verify that, for any k > 1 and n > 0, the game has no value, i.e. v∗ (n, k) < v ∗ (n, k). For many
reasons arising from applications in data analysis, combinatorial optimization, and computational geometry, it
is important to have an upper bound for v∗ (n), which means the guaranteed accuracy of k-medians clustering of
an appropriate n-points sample. Although, v ∗ (n, k) can obviously be taken as an upper bound, for large values
of n it is imprecise and should be replaced with more accurate one.
    In this paper, we propose an attainable upper bound B(n, k) for v∗ (n, k). Actually, to any n > 0, k > 1, and
ξ ∈ [0, 1]n , we show how to assign an appropriate k-tuple σξ = (c1 , . . . , ck ), i.e. how to construct a clustering
C1 , . . . , Ck with medians c1 , . . . , ck , such that
                                          inf      F (ξ, σ) ≤ F (ξ, σξ ) ≤ B(n, k).
                                        σ∈[0,1]k

Theorem.
 (i) For any k > 1, n > 0, and sample ξ = (x1 , . . . , xn ), xi ∈ [0, 1], i ∈ Nn , there exists the k-tuple σξ =
     (c1 , . . . , ck ), cj ∈ [0, 1], j ∈ Nk , such that
                                                                          n
                                                         F (ξ, σξ ) ≤           .                              (2)
                                                                      2(2k − 1)
(ii) For any k > 1, there is ñ = ñ(k) such that, for all n > ñ, bound (2) is attained at some sample ξ = ξ(k, n).
  Postponing the rigorous proof to the forthcoming paper, we restrict ourselves to some suggestive thoughts.
To put it simple, we consider the case of k = 2.

3     Proof Sketch for k = 2
We start with the following simple upper bound

3.1    Naı̈ve Upper Bound
It can be assumed that the second player always adheres to the following strategy. He splits the segment [0, 1]
onto two equal parts and put c1 and c2 at the centers of each part as it is shown in Fig. 1
   Obviously, in this case, for any x∑∈ [0, 1], min{|x − c1 |, |x − c2 |} ≤ 1/4. Therefore, regardless of the choice
                                           n
ξ = (x1 , . . . , xn ) of the ﬁrst player, i=1 min{|xi − c1 |, |xi − c2 |} ≤ n/4, i.e. B(n, 2) ≤ n/4. Since, to complete
the ﬁrst point of the proof (for the considered case k = 2), we need to show that B(n, 2) ≤ n/6, we need further
improvements.


                                                             323
                                                      Figure 1: Simple upper bound
3.2   Reducing to Linear Program
Hereinafter, without loss of generality, we assume that any sample ξ = (x1 , . . . , xn ) contains points xi in ascending
order. Moreover, we assume that any cluster C = {i1 , . . . , im } ⊂ Nn inherites this property, i.e. xi1 ≤ . . . ≤ xim .
Then, for the median c of the cluster C we have
                                        ⌊m/2⌋                                                ⌊m/2⌋
                  ∑
                  m                     ∑                           ∑
                                                                    m                         ∑                      ∑
                                                                                                                     m
                        |xil − c| =            (c − xil ) +                  (xil − c) = −              xil +               xil .              (3)
                  l=1                    l=1                  l=⌈m/2⌉+1                           l=1           l=⌈m/2⌉+1

Therefore, for a given sample ξ, Φ(ξ) = inf σ=(c1 ,c2 ) F (ξ, σ) depends on choice of partitions C1 ∪ C2 = Nn
ultimately and obeys the equation
              {                                               }
                ∑                ∑
  Φ(ξ) = min        |xi − c1 | +   |xi − c2 | : C1 ∪ C2 = Nn
                  i∈C1                  i∈C
                                   2
                                                                                                                                           
                               ⌊m∑
                                  1 /2⌋
                                                             ∑
                                                             m1                ⌊m2 /2⌋
                                                                                ∑                         ∑
                                                                                                          m2                               
                         = min −        xi +                           xi −              xi+m1 +                     xi+m1 : m1 + m2 = n       .
                                                                                                                                          
                                         i=1             i=⌈m1 /2⌉+1            i=1                    i=⌈m2 /2⌉+1

Thus, v∗ (n, 2) = supξ∈[0,1]n Φ(ξ) is an optimum value of linear program (4)

  v∗ (n, 2) = max u
              s.t.
                              ⌊m∑
                                1 /2⌋               ∑1
                                                    m                ⌊m∑
                                                                       2 /2⌋                      ∑2
                                                                                                  m
                          −             xi +                  xi −             xi+m1 +                     xi+m1 ≥ u,       (m1 + m2 = n),
                               i=1              i=⌈m1 /2⌉+1            i=1                 i=⌈m2 /2⌉+1

                          0 ≤ x1 ≤ . . . ≤ xn ≤ 1.
                                                                                                                      (4)
Further, guided by the symmetry argument, we can reduce the number of variables (and also, the number of
constraints) in problem (4) by half. Indeed, suppose, ξ ′ = (x′1 , . . . , x′n ) is an optimal solution of (4). Then, by
symmetry, ξ ′′ = (1 − x′n , . . . , 1 − x′1 ) is an optimal solution of (4) as well. Convexity of the optimal set2 of (4)
implies that ξ = (ξ ′ + ξ ′′ )/2, each whose entry is deﬁned by the formula xi = (1 + x′i − x′n+1−i )/2 is also an
optimal solution. Since xi + xn+1−i = 1, hereinafter, we reduce the number of variables to ⌊n/2⌋. Moreover, for
odd n, x⌈n/2⌉ = 1/2.
  To show that B(n, 2) ≤ n/6, we study all cases for (n mod 6).
Case n = 6t:
Consider the constraint of (4) deﬁned by m1 = 2t and m2 = 4t.
                               ∑
                               t               ∑
                                               2t           ∑
                                                            3t               ∑
                                                                             3t                   ∑
                                                                                                  2t
                           −         xi +           xi −            xi −            (1 − xi ) +      (1 − xi ) ≥ u,
                               i=1          i=t+1          i=2t+1          i=2t+1                 i=1
                                   ∑t
which is equivalent to u + 2            i=1 xi ≤ t. Since all xi ≥ 0, u ≤ t = n/6, and we are done.

Case n = 6t + 1:
Here, we consider two constraints of (4), deﬁned by m1 = 2t, m2 = 4t + 1 and m1 = 2t + 1, m2 = 4t, respectively.
They are
                       ∑t        ∑2t         ∑3t
                                                       1   ∑3t              ∑2t
                     −     xi +       xi −         xi − −       (1 − xi ) +     (1 − xi ) ≥ u
                       i=1      i=t+1       i=2t+1
                                                       2 i=2t+2             i=1
  2 The set of optimal solutions


                                                                        324
and
                         ∑
                         t            ∑
                                      2t+1              ∑
                                                        3t
                                                                        1   ∑
                                                                            3t               ∑2t
                     −         xi +           xi −              xi −      −      (1 − xi ) +     (1 − xi ) ≥ u.
                         i=1          i=t+2          i=2t+2
                                                                        2 i=2t+1             i=1
After the equivalent transformation, we obtain the subsystem
                                   
                                          ∑
                                           t
                                   
                                   u + 2     xi + x2t+1 ≤ t + 12
                                                        i=1
                                           
                                                 ∑
                                                  t
                                           u + 2   xi + xt+1 − 2x2t+1 ≤ t − 21 ,
                                                        i=1

which implies
                                          ∑
                                          t
                               3u + 6           xi + xt+1 ≤ 3t + 1/2 and u ≤ t + 1/6 = n/6.
                                          i=1

In case n = 6t + 2
we take constraints deﬁned by m1 = 2t + 1, m2 = 4t + 1 and m1 = 2t, m2 = 4t + 2:

                             ∑
                             t            ∑
                                          2t+1            ∑
                                                          3t+1               ∑
                                                                             3t+1                  ∑
                                                                                                   2t
                         −         xi +           xi −               xi −            (1 − xi ) +      (1 − xi ) ≥ u
                             i=1          i=t+2          i=2t+2             i=2t+2                 i=1

                             ∑
                             t            ∑
                                          2t              ∑
                                                          3t+1               ∑
                                                                             3t+1                  ∑
                                                                                                   2t+1
                         −         xi +           xi −               xi −            (1 − xi ) +          (1 − xi ) ≥ u.
                             i=1          i=t+1          i=2t+1             i=2t+2                 i=1

Transformed                                       
                                                        ∑
                                                         t
                                                  
                                                  u + 2   xi − x2t+1 ≤ t
                                                               i=1
                                                  
                                                        ∑
                                                         t
                                                  u + 2   xi + 2x2t+1 ≤ t + 1,
                                                               i=1
they imply
                                                  ∑
                                                  t
                                       3u + 6           xi ≤ 3t + 1 i.e. u ≤ t + 1/3 = n/6.
                                                  i=1

Case n = 6t + 3
is similar to the case n = 6t. Here, to obtain the desired bound, it is enough to consider the single constraint
deﬁned by m1 = 2t + 1 and m2 = 4t + 2
                         ∑
                         t            ∑
                                      2t+1
                                                      1 ∑∑
                                                        3t+1
                                                         3t+1              ∑
                                                                           2t+1
                     −     xi +       xi −        xi − −       (1 − xi ) +      (1 − xi ) ≥ u.                             (5)
                       i=1      i=t+2      i=2t+2
                                                      2 i=2t+2              i=1

Being transformed, (5) becomes
                                                              ∑
                                                              t
                                                   u+2               xi + xt+1 ≤ t + 1/2,
                                                               i=1
which implies u ≤ t + 1/2 = n/6.

In case n = 6t + 4
we convolve again two appropriate constraints deﬁned by m1 = 2t + 1, m2 = 4t + 3 and m1 = 2t + 2, m2 = 4t + 2
                             ∑
                             t            ∑
                                          2t+1            ∑
                                                          3t+2               ∑
                                                                             3t+2                  ∑
                                                                                                   2t+1
                         −         xi +           xi −               xi −            (1 − xi ) +          (1 − xi ) ≥ u
                             i=1          i=t+2          i=2t+2             i=2t+3                 i=1

                             ∑
                             t+1          ∑
                                          2t+2            ∑
                                                          3t+2               ∑
                                                                             3t+2                  ∑
                                                                                                   2t+1
                         −         xi +           xi −               xi −            (1 − xi ) +         (1 − xi ) ≥ u,
                             i=1          i=l+2          i=2t+3             i=2t+2                 i=1


                                                                        325
which, after the equivalent transformation give the subsystem
                                                  
                                                        ∑
                                                         t
                                                  
                                                  u + 2   xi + x2t+2 ≤ t + 1
                                                               i=1
                                                  
                                                        ∑
                                                         t+1
                                                  u + 2     xi − 2x2t+2 ≤ t
                                                               i=1

implying
                                           ∑
                                           t
                                  3u + 6         xi + 2xt+1 ≤ 3t + 2 i.e. u ≤ t + 2/3 = n/6.
                                           i=1


Finally, in case n = 6t + 5
transforming the constraints deﬁned by m1 = 2t + 2, m2 = 4t + 3 and m1 = 2t + 1, m2 = 4t + 4

                           ∑
                           t+1          ∑
                                        2t+2            ∑
                                                        3t+2
                                                                       1   ∑
                                                                           3t+2             ∑
                                                                                            2t+1
                       −         xi +           xi −            xi −     −      (1 − xi ) +      (1 − xi ) ≥ u
                           i=1          i=t+2          i=2t+3
                                                                       2 i=2t+3              i=1

                           ∑
                           t            ∑
                                        2t+1            ∑
                                                        3t+2
                                                                       1   ∑
                                                                           3t+2             ∑
                                                                                            2t+2
                       −         xi +           xi −            xi −     −      (1 − xi ) +      (1 − xi ) ≥ u
                           i=1          i=t+2          i=2t+2
                                                                       2 i=2t+3              i=1

we obtain the subsystem
                                           
                                                 ∑
                                                  t+1
                                           
                                           u + 2     xi − x2t+2 ≤ t + 12
                                                        i=1
                                           
                                                 ∑
                                                  t
                                           u + 2   xi + xt+1 + x2t+2 ≤ t + 32 ,
                                                        i=1

which, being convolved, gives us

                                          ∑
                                          t
                                 3u + 6         xi + 5xt+1 ≤ 3t + 5/2 =⇒ u ≤ t + 5/6 = n/6.
                                          i=1

   Thus, we completely proved point (i) of Theorem for the case of k = 2.

3.3   Attainability
Now, we show that for any n ≥ 12 inequality (2) is tight. Consider the following conﬁguration given by locations
p1 , . . . , p5


                                                       Figure 2: The configuration
   Place n = 4⌊ n4 ⌋ + { n4 } points at the locations p1 , . . . , p5 with multiplicities presented at Fig. 3


                                                       Figure 3: Placing the points
   Since n ≥ 12, the multiplicities of points located at p1 , p2 , p4 , and p5 are at least 3 and at most 3 points
are located at p3 . By the symmetry of the sample obtained, there are two best options to partition it into two
clusters C1 = {1, . . . , ⌊n/4⌋}, C2 = {⌊n/4⌋ + 1, . . . , n} and C1 = {1, . . . , 2⌊n/4⌋}, C2 = {2⌊n/4⌋ + 1, . . . , n} (see
Fig.4).


                                                                       326
                                          Figure 4: Two ways of possible clustering
  Let us calculate the cost F (ξ, σ) for each option. In the ﬁrst case
                                                         ∑
                                              F (ξ, σ) =    |xi − c2 |,
                                                                 i∈C2

where c2 = p4 (since n > 12). Therefore,
                              ⌊n⌋ 1       {n} 1        ⌊n⌋ 1       ⌊n⌋ 2         {n} 1              ⌊n⌋       {n}
                                                                                                4    4    +    4        n
                 F (ξ, σ) =           +            +           =             +              =                       =     .
                               4 3         4   6        4 3            4 3        4     6                 6             6
  Consider the second case. Here, again c2 = p4 . Therefore,
                                                   ⌊n⌋ 1         {n} 1          ⌊n⌋ 1       n
                                      F (ξ, σ) =             +              +           =     ,
                                                       4 3         4    6        4 3        6
i.e. Theorem is completely proved so as point (ii).

Acknowledgements
This research was supported by Russian Foundation for Basic Research, projects no. 16-07-00266 and no. 17-08-
01385.

References
[Ben-David, 2015] Ben-David, S. (2015). Computational feasibility of clustering under clusterability assumptions.
  CoRR, abs/1501.00437.
[Grønlund et al., 2017] Grønlund, A., Larsen, K. G., Mathiasen, A., and Nielsen, J. S. (2017). Fast exact k-
  means, k-medians and bregman divergence clustering in 1d. CoRR, abs/1701.07204.

[Guruswami and Indyk, 2003] Guruswami, V. and Indyk, P. (2003). Embeddings and non-approximability of
  geometric problems. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms,
  SODA ’03, pages 537–538, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics.
[Har-Peled and Mazumdar, 2004] Har-Peled, S. and Mazumdar, S. (2004). On coresets for k-means and k-median
  clustering. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, STOC ’04,
  pages 291–300, New York, NY, USA. ACM.

[Khachai and Neznakhina, 2017] Khachai, M. and Neznakhina, E. (2017). Solvability of the Generalized Travel-
  ing Salesman Problem in the class of quasi- and pseudo-pyramidal tours (in Russian). Trudy Inst. Matematiki
  i Mechaniki UrO RAN, 23(3):280–291.
[Kumar et al., 2010] Kumar, A., Sabharwal, Y., and Sen, S. (2010). Linear-time approximation schemes for
  clustering problems in any dimensions. J. ACM, 57(2):5:1–5:32.


                                                                 327