Attainable Best Guarantee for the Accuracy of k-medians Clustering in [0, 1] Michael Khachay Krasovsky Institute of Mathematics and Mechanics, Ural Federal University, Ekaterinburg, Russia Omsk State Technical University, Omsk, Russia mkhachay@imm.uran.ru Vasiliy Pankratov Krasovsky Institute of Mathematics and Mechanics, pankratov.vs@gmail.com Daniel Khachay Krasovsky Institute of Mathematics and Mechanics, Ural Federal University, Ekaterinburg, Russia dmx@imm.uran.ru Abstract In this paper, one-dimensional k-medians clustering problem is con- sidered in the context of zero-sum game between players choosing a sample and partitioning it into clusters, respectively. For any sample size n and k > 1, an attainable guaranteed value of the clustering accu- racy 0.5n/(2k − 1) (the low value of an appropriate game) is provided for samples taken from the segment [0, 1]. 1 Introduction In data analysis, k-medians clustering problem is regarded as one of the famous center-based metric clustering problems, whose instance can be defined as follows. For a given number k ≥ 1 and a finite sample ξ = (x1 , . . . , xn ) taken from a metric space (X, ρ), it is required to find a partition of Nn = {1, . . . , n} onto k clusters C1 , . . . , Ck and, for any j-th cluster, to point out an appropriate center cj such that ∑ k ∑ ∑ n ρ(xi , cj ) = min{ρ(xi , c1 ), . . . , ρ(xi , ck )} → min . (1) j=1 i∈Cj i=1 {∑ } Equation (1) evidently implies that, for any j, the point cj ∈ Arg min i∈Cj ρ(xi , c) : c ∈ X , i.e. cj is a median of the subsample ξj = (xi : i ∈ Cj ). As a combinatorial optimization problem, k-medians is shown1 to be intractable [Guruswami and Indyk, 2003] even for the Euclidean metric and has no PTAS, unless P = N P . For d-dimensional Euclidean spaces there Copyright ⃝ c by the paper’s authors. Copying permitted for private and academic purposes. In: Yu. G. Evtushenko, M. Yu. Khachay, O. V. Khamisov, Yu. A. Kochetov, V.U. Malkova, M.A. Posypkin (eds.): Proceedings of the OPTIMA-2017 Conference, Petrovac, Montenegro, 02-Oct-2017, published at http://ceur-ws.org 1 If k is a part of an instance. 322 are known numerous approximation results. For instance, in [Kumar et al., 2010], for any fixed k, randomized O(1) LTAS with time complexity of O(2(k/ε) · dn) is proposed. On the basis of the famous coresets technique, in [Har-Peled and Mazumdar, 2004], RPTAS with polynomially depending on the number of clusters k time complexity bound O(n + ρ(k log n)O(1) ), where ρ = exp(O((1 − log ε)/ε)d−1 ) is proposed. For d = 1, k-medians problem is polynomially (and very efficiently) solvable. To date, the most efficient exact algorithm with time complexity O(n log n + kn) is proposed in [Grønlund et al., 2017]. Among others, the setting, where it is required to obtain a guaranteed accuracy of clustering for a fixed number of clusters k and an arbitrary sample, is valuable ([Ben-David, 2015, Khachai and Neznakhina, 2017]) for applications in combinatorial optimization and data analysis. In this paper, we study such a setting for the 1d-case of the k-medians clustering problem. 2 Problem Statement and the Main Result We consider the following two-player zero-sum game induced by k-medians clustering. There are two players placing points in the unit segment of the real line. Strategies of the first player are samples ξ = (x1 , . . . , xn ), xi ∈ [0, 1] of some given ∑n size n. Strategies of the second one are k-tuples σ = (c1 , . . . , ck ), ci ∈ [0, 1]. The payoff function F (ξ, σ) = i=1 min{|xi − c1 |, . . . , |xi − ck |}. Goals of the first and the second players are to find the lower v∗ (n, k) = sup inf F (ξ, σ) k ξ∈[0,1]n σ∈[0,1] and the higher v ∗ (n, k) = inf sup F (ξ, σ) σ∈[0,1]k ξ∈[0,1]n values of the game, respectively. It is easy to verify that, for any k > 1 and n > 0, the game has no value, i.e. v∗ (n, k) < v ∗ (n, k). For many reasons arising from applications in data analysis, combinatorial optimization, and computational geometry, it is important to have an upper bound for v∗ (n), which means the guaranteed accuracy of k-medians clustering of an appropriate n-points sample. Although, v ∗ (n, k) can obviously be taken as an upper bound, for large values of n it is imprecise and should be replaced with more accurate one. In this paper, we propose an attainable upper bound B(n, k) for v∗ (n, k). Actually, to any n > 0, k > 1, and ξ ∈ [0, 1]n , we show how to assign an appropriate k-tuple σξ = (c1 , . . . , ck ), i.e. how to construct a clustering C1 , . . . , Ck with medians c1 , . . . , ck , such that inf F (ξ, σ) ≤ F (ξ, σξ ) ≤ B(n, k). σ∈[0,1]k Theorem. (i) For any k > 1, n > 0, and sample ξ = (x1 , . . . , xn ), xi ∈ [0, 1], i ∈ Nn , there exists the k-tuple σξ = (c1 , . . . , ck ), cj ∈ [0, 1], j ∈ Nk , such that n F (ξ, σξ ) ≤ . (2) 2(2k − 1) (ii) For any k > 1, there is ñ = ñ(k) such that, for all n > ñ, bound (2) is attained at some sample ξ = ξ(k, n). Postponing the rigorous proof to the forthcoming paper, we restrict ourselves to some suggestive thoughts. To put it simple, we consider the case of k = 2. 3 Proof Sketch for k = 2 We start with the following simple upper bound 3.1 Naı̈ve Upper Bound It can be assumed that the second player always adheres to the following strategy. He splits the segment [0, 1] onto two equal parts and put c1 and c2 at the centers of each part as it is shown in Fig. 1 Obviously, in this case, for any x∑∈ [0, 1], min{|x − c1 |, |x − c2 |} ≤ 1/4. Therefore, regardless of the choice n ξ = (x1 , . . . , xn ) of the first player, i=1 min{|xi − c1 |, |xi − c2 |} ≤ n/4, i.e. B(n, 2) ≤ n/4. Since, to complete the first point of the proof (for the considered case k = 2), we need to show that B(n, 2) ≤ n/6, we need further improvements. 323 Figure 1: Simple upper bound 3.2 Reducing to Linear Program Hereinafter, without loss of generality, we assume that any sample ξ = (x1 , . . . , xn ) contains points xi in ascending order. Moreover, we assume that any cluster C = {i1 , . . . , im } ⊂ Nn inherites this property, i.e. xi1 ≤ . . . ≤ xim . Then, for the median c of the cluster C we have ⌊m/2⌋ ⌊m/2⌋ ∑ m ∑ ∑ m ∑ ∑ m |xil − c| = (c − xil ) + (xil − c) = − xil + xil . (3) l=1 l=1 l=⌈m/2⌉+1 l=1 l=⌈m/2⌉+1 Therefore, for a given sample ξ, Φ(ξ) = inf σ=(c1 ,c2 ) F (ξ, σ) depends on choice of partitions C1 ∪ C2 = Nn ultimately and obeys the equation { } ∑ ∑ Φ(ξ) = min |xi − c1 | + |xi − c2 | : C1 ∪ C2 = Nn i∈C1 i∈C  2   ⌊m∑ 1 /2⌋ ∑ m1 ⌊m2 /2⌋ ∑ ∑ m2  = min − xi + xi − xi+m1 + xi+m1 : m1 + m2 = n .   i=1 i=⌈m1 /2⌉+1 i=1 i=⌈m2 /2⌉+1 Thus, v∗ (n, 2) = supξ∈[0,1]n Φ(ξ) is an optimum value of linear program (4) v∗ (n, 2) = max u s.t. ⌊m∑ 1 /2⌋ ∑1 m ⌊m∑ 2 /2⌋ ∑2 m − xi + xi − xi+m1 + xi+m1 ≥ u, (m1 + m2 = n), i=1 i=⌈m1 /2⌉+1 i=1 i=⌈m2 /2⌉+1 0 ≤ x1 ≤ . . . ≤ xn ≤ 1. (4) Further, guided by the symmetry argument, we can reduce the number of variables (and also, the number of constraints) in problem (4) by half. Indeed, suppose, ξ ′ = (x′1 , . . . , x′n ) is an optimal solution of (4). Then, by symmetry, ξ ′′ = (1 − x′n , . . . , 1 − x′1 ) is an optimal solution of (4) as well. Convexity of the optimal set2 of (4) implies that ξ = (ξ ′ + ξ ′′ )/2, each whose entry is defined by the formula xi = (1 + x′i − x′n+1−i )/2 is also an optimal solution. Since xi + xn+1−i = 1, hereinafter, we reduce the number of variables to ⌊n/2⌋. Moreover, for odd n, x⌈n/2⌉ = 1/2. To show that B(n, 2) ≤ n/6, we study all cases for (n mod 6). Case n = 6t: Consider the constraint of (4) defined by m1 = 2t and m2 = 4t. ∑ t ∑ 2t ∑ 3t ∑ 3t ∑ 2t − xi + xi − xi − (1 − xi ) + (1 − xi ) ≥ u, i=1 i=t+1 i=2t+1 i=2t+1 i=1 ∑t which is equivalent to u + 2 i=1 xi ≤ t. Since all xi ≥ 0, u ≤ t = n/6, and we are done. Case n = 6t + 1: Here, we consider two constraints of (4), defined by m1 = 2t, m2 = 4t + 1 and m1 = 2t + 1, m2 = 4t, respectively. They are ∑t ∑2t ∑3t 1 ∑3t ∑2t − xi + xi − xi − − (1 − xi ) + (1 − xi ) ≥ u i=1 i=t+1 i=2t+1 2 i=2t+2 i=1 2 The set of optimal solutions 324 and ∑ t ∑ 2t+1 ∑ 3t 1 ∑ 3t ∑2t − xi + xi − xi − − (1 − xi ) + (1 − xi ) ≥ u. i=1 i=t+2 i=2t+2 2 i=2t+1 i=1 After the equivalent transformation, we obtain the subsystem   ∑ t  u + 2 xi + x2t+1 ≤ t + 12 i=1   ∑ t u + 2 xi + xt+1 − 2x2t+1 ≤ t − 21 , i=1 which implies ∑ t 3u + 6 xi + xt+1 ≤ 3t + 1/2 and u ≤ t + 1/6 = n/6. i=1 In case n = 6t + 2 we take constraints defined by m1 = 2t + 1, m2 = 4t + 1 and m1 = 2t, m2 = 4t + 2: ∑ t ∑ 2t+1 ∑ 3t+1 ∑ 3t+1 ∑ 2t − xi + xi − xi − (1 − xi ) + (1 − xi ) ≥ u i=1 i=t+2 i=2t+2 i=2t+2 i=1 ∑ t ∑ 2t ∑ 3t+1 ∑ 3t+1 ∑ 2t+1 − xi + xi − xi − (1 − xi ) + (1 − xi ) ≥ u. i=1 i=t+1 i=2t+1 i=2t+2 i=1 Transformed   ∑ t  u + 2 xi − x2t+1 ≤ t i=1   ∑ t u + 2 xi + 2x2t+1 ≤ t + 1, i=1 they imply ∑ t 3u + 6 xi ≤ 3t + 1 i.e. u ≤ t + 1/3 = n/6. i=1 Case n = 6t + 3 is similar to the case n = 6t. Here, to obtain the desired bound, it is enough to consider the single constraint defined by m1 = 2t + 1 and m2 = 4t + 2 ∑ t ∑ 2t+1 1 ∑∑ 3t+1 3t+1 ∑ 2t+1 − xi + xi − xi − − (1 − xi ) + (1 − xi ) ≥ u. (5) i=1 i=t+2 i=2t+2 2 i=2t+2 i=1 Being transformed, (5) becomes ∑ t u+2 xi + xt+1 ≤ t + 1/2, i=1 which implies u ≤ t + 1/2 = n/6. In case n = 6t + 4 we convolve again two appropriate constraints defined by m1 = 2t + 1, m2 = 4t + 3 and m1 = 2t + 2, m2 = 4t + 2 ∑ t ∑ 2t+1 ∑ 3t+2 ∑ 3t+2 ∑ 2t+1 − xi + xi − xi − (1 − xi ) + (1 − xi ) ≥ u i=1 i=t+2 i=2t+2 i=2t+3 i=1 ∑ t+1 ∑ 2t+2 ∑ 3t+2 ∑ 3t+2 ∑ 2t+1 − xi + xi − xi − (1 − xi ) + (1 − xi ) ≥ u, i=1 i=l+2 i=2t+3 i=2t+2 i=1 325 which, after the equivalent transformation give the subsystem   ∑ t  u + 2 xi + x2t+2 ≤ t + 1 i=1   ∑ t+1 u + 2 xi − 2x2t+2 ≤ t i=1 implying ∑ t 3u + 6 xi + 2xt+1 ≤ 3t + 2 i.e. u ≤ t + 2/3 = n/6. i=1 Finally, in case n = 6t + 5 transforming the constraints defined by m1 = 2t + 2, m2 = 4t + 3 and m1 = 2t + 1, m2 = 4t + 4 ∑ t+1 ∑ 2t+2 ∑ 3t+2 1 ∑ 3t+2 ∑ 2t+1 − xi + xi − xi − − (1 − xi ) + (1 − xi ) ≥ u i=1 i=t+2 i=2t+3 2 i=2t+3 i=1 ∑ t ∑ 2t+1 ∑ 3t+2 1 ∑ 3t+2 ∑ 2t+2 − xi + xi − xi − − (1 − xi ) + (1 − xi ) ≥ u i=1 i=t+2 i=2t+2 2 i=2t+3 i=1 we obtain the subsystem   ∑ t+1  u + 2 xi − x2t+2 ≤ t + 12 i=1   ∑ t u + 2 xi + xt+1 + x2t+2 ≤ t + 32 , i=1 which, being convolved, gives us ∑ t 3u + 6 xi + 5xt+1 ≤ 3t + 5/2 =⇒ u ≤ t + 5/6 = n/6. i=1 Thus, we completely proved point (i) of Theorem for the case of k = 2. 3.3 Attainability Now, we show that for any n ≥ 12 inequality (2) is tight. Consider the following configuration given by locations p1 , . . . , p5 Figure 2: The configuration Place n = 4⌊ n4 ⌋ + { n4 } points at the locations p1 , . . . , p5 with multiplicities presented at Fig. 3 Figure 3: Placing the points Since n ≥ 12, the multiplicities of points located at p1 , p2 , p4 , and p5 are at least 3 and at most 3 points are located at p3 . By the symmetry of the sample obtained, there are two best options to partition it into two clusters C1 = {1, . . . , ⌊n/4⌋}, C2 = {⌊n/4⌋ + 1, . . . , n} and C1 = {1, . . . , 2⌊n/4⌋}, C2 = {2⌊n/4⌋ + 1, . . . , n} (see Fig.4). 326 Figure 4: Two ways of possible clustering Let us calculate the cost F (ξ, σ) for each option. In the first case ∑ F (ξ, σ) = |xi − c2 |, i∈C2 where c2 = p4 (since n > 12). Therefore, ⌊n⌋ 1 {n} 1 ⌊n⌋ 1 ⌊n⌋ 2 {n} 1 ⌊n⌋ {n} 4 4 + 4 n F (ξ, σ) = + + = + = = . 4 3 4 6 4 3 4 3 4 6 6 6 Consider the second case. Here, again c2 = p4 . Therefore, ⌊n⌋ 1 {n} 1 ⌊n⌋ 1 n F (ξ, σ) = + + = , 4 3 4 6 4 3 6 i.e. Theorem is completely proved so as point (ii). Acknowledgements This research was supported by Russian Foundation for Basic Research, projects no. 16-07-00266 and no. 17-08- 01385. References [Ben-David, 2015] Ben-David, S. (2015). Computational feasibility of clustering under clusterability assumptions. CoRR, abs/1501.00437. [Grønlund et al., 2017] Grønlund, A., Larsen, K. G., Mathiasen, A., and Nielsen, J. S. (2017). Fast exact k- means, k-medians and bregman divergence clustering in 1d. CoRR, abs/1701.07204. [Guruswami and Indyk, 2003] Guruswami, V. and Indyk, P. (2003). Embeddings and non-approximability of geometric problems. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pages 537–538, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics. [Har-Peled and Mazumdar, 2004] Har-Peled, S. and Mazumdar, S. (2004). On coresets for k-means and k-median clustering. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, STOC ’04, pages 291–300, New York, NY, USA. ACM. [Khachai and Neznakhina, 2017] Khachai, M. and Neznakhina, E. (2017). Solvability of the Generalized Travel- ing Salesman Problem in the class of quasi- and pseudo-pyramidal tours (in Russian). Trudy Inst. Matematiki i Mechaniki UrO RAN, 23(3):280–291. [Kumar et al., 2010] Kumar, A., Sabharwal, Y., and Sen, S. (2010). Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1–5:32. 327